OpenAI has launched HealthBase, a benchmark dataset designed to test AI tools developed to answer medical questions. Experts say the arrival of the benchmark dataset is a big step forward in separating the wheat from the chaff when it comes to generative AI tools for healthcare.
HealthBench is OpenAI's first major independent healthcare project. It includes 5,000 “realistic health conversations”, each with detailed assessment tools to evaluate responses generated by AI tools. "Our mission as OpenAI is to ensure that general generative AI (AGI) is good for humanity. Partly to build and deploy technology and partly to ensure that AI tools a LLMs developed for healthcare are safe and reliable," said Karan Singhal, head of the company's healthcare AI team.
Generative AI and LLMs in healthcare
Several generative AI tools and large language model (LLM) based solutions are already deployed within healthcare. Although the added value of such solutions has been described and demonstrated several times, the quality of some tools still falls short of the high demands in healthcare. This was also recently shown in an Israeli study. It concluded that most large language models (LLMs), such as ChatGPT, still underperform for medical decision-making.
Despite the growing popularity of AI in healthcare, the research shows that these models often provide inaccurate or inconsistent information, posing risks when making medical decisions. The researchers stress that although AI has potential, human expertise remains essential in the diagnostic process. They therefore recommend using AI tools only as a support tool for now, and not as a replacement for medical professionals.
HealthBase benchmark
The dataset was created with the help of 262 doctors from 60 countries. They provided more than 57,000 unique criteria to assess how well AI tools answer health questions. The 5,000 examples in HealthBench were created using synthesised conversations designed by doctors. ‘We wanted to balance the benefits of releasing the data and of course the privacy constraints of using realistic data,’ says Singhal
The dataset also includes 1,000 difficult examples that AI models struggled with. OpenAI hopes these examples can help improve existing (and new) AI tools in the (near) future. OpenAI wants AI models to be able to be better compared, assessed and evaluated using the HealthBench dataset. In doing so, the company has considered three factors:
- Meaningful: Scores reflect real-world impact. This goes beyond exam questions to include complex, real-life scenarios and workflows that reflect how individuals and doctors interact with models.
- Reliable: Scores are reliable indicators of doctors' judgement. Evaluations should reflect the standards and priorities of healthcare professionals, providing a rigorous basis for improving AI systems.
- Unsaturated: Benchmarks support progress. Current models should show significant room for improvement, encouraging model developers to continuously improve their performance.
Rubric evaluation
HealthBench is a rubric evaluation, where each model answer is assessed against a set of rubric criteria written by doctors that are specific to that conversation. Each criterion indicates what an ideal answer should include or avoid, for example a specific fact that should be stated or unnecessary technical jargon that should be avoided. Each criterion has an associated point value, weighted to match the doctor's assessment of the importance of that criterion.
HealthBench contains 48,562 unique rubric criteria, which comprehensively address specific facets of model performance. Model responses are evaluated by a model-based evaluator (GPT-4.1) to assess whether each rubric criterion has been met, and responses are given an overall score based on the total score of the fulfilled criteria compared with the maximum possible score.