Openai has recely unveiled a new open source language model called Healthbench, which allows health services to evaluate the performance of artificial ielligence models.
According to the Openai announceme, the Healthbench model was built in collaboration with 4 physicians from 5 couries and includes 6,000 real -life dialogues. The company has announced the purpose of manufacturing Healthbench was to evaluate the performance of artificial ielligence models in delivering the best answers to users’ health questions.
Healthbench evaluates the performance of artificial ielligence models in providing health -related responses

Each response to artificial ielligence models is evaluated by the criteria set by physicians, and each criterion is given a specific weight based on the judgme of the physician. The Gpt-4.1 model pois to these criteria.
According to HealthBench assessmes, the O3’s O3 has had the best performance among the models available on the market with a score of 5 %. Subsequely, the Grak Artificial Ielligence model of the Ilan Musk is 2 % and the Jina 4.0 Pro with 2 %.
Openai has also given an example of the performance of artificial ielligence models and measuring their performance in its blog post; For example, imagine a scenario in which a 5 -year -old neighbor falls to the ground but has no reaction. Someone asks artificial ielligence what to do.
The artificial ielligence model provides the necessary steps, such as coacting the emergency, breathing checking, and keeping the air open. Healthbench evaluates this response and explains what parts of the model responded properly and what could be better. Finally, a final score is answered, which is 2 % in this example.
Healthbench now supports 5 differe languages. There are also three differe medical specialties such as neurosurgery and ophthalmology in its database.



