It is no longer big news that AI can effectively support doctors and other healthcare professionals in performing administrative tasks, assessing medical images (ultrasound, CT and MRI scans) and making diagnoses. We also know that AI is still far from being able to completely and independently take over the tasks of doctors, radiologists and other healthcare professionals. Humans are, and will always remain, responsible. As mentioned, AI has a supporting role. And that this works well has now been demonstrated in an international study led by the Max Planck Institute for Human Development.
The study, conducted in collaboration with partners from the Human Diagnosis Project (San Francisco) and the Institute for Cognitive Sciences and Technologies of the Italian National Research Council (CNR-ISTC Rome), looked at how AI and medical professionals can work together most efficiently and accurately.
AI makes different mistakes than humans
AI solutions, particularly large language models (LLMs) such as ChatGPT, Gemini or Claude, can help in making a diagnosis. However, there are also risks associated with using these tools. Perhaps the best known is the “hallucinating” of AI bots, whereby information is almost literally made up. This is annoying when it happens in an essay for Dutch class at school, but potentially fatal when it comes to a medical assessment or diagnosis.
Hallucinating is something that a doctor will not do when assessing symptoms or a medical scan. However, doctors and radiologists can “miss” clues. Simply because they are invisible to the human eye or because of a wrong conclusion. In short, humans make different mistakes than AI tools, and vice versa.
Collaboration between AI and humans
The study, which has now been conducted and published in the Proceedings of the National Academy of Sciences, shows that combining human expertise with AI models leads to the most accurate open diagnoses. The researchers talk about developing hybrid diagnostic collectives. In these collectives, human experts work together with AI systems. According to the researchers, these teams are significantly more accurate than collectives consisting exclusively of humans or AI. This is particularly true for complex, open diagnostic questions with numerous possible solutions, rather than simple yes/no decisions.
‘Our results show that collaboration between humans and AI models has great potential to improve patient safety,’ says lead author Nikolas Zöller, postdoctoral researcher at the Centre for Adaptive Rationality at the Max Planck Institute for Human Development.
The study used data from more than 2,100 so-called medical vignettes from the Human Diagnosis Project. These are short descriptions of medical case studies, including the correct diagnosis. The diagnoses of doctors were then compared with those of five leading AI models.
The researchers simulated different diagnostic collectives: individuals, human collectives, AI models and mixed human-AI collectives. In total, the researchers analysed more than 40,000 diagnoses. Each of these was classified and evaluated according to international medical standards (SNOMED CT).
Humans and AI complement each other
The study shows that combining multiple AI models improved diagnostic quality. On average, the AI collectives outperformed 85 per cent of human diagnosticians. However, there were numerous cases in which humans performed better. Interestingly, humans often got the diagnosis right when AI failed.
The biggest surprise was that combining both worlds led to a significant increase in accuracy. Even adding a single AI model to a group of human diagnosticians, or vice versa, significantly improved the result. The most reliable results came from collective decisions involving multiple humans and multiple AI solutions.
The explanation for this is that humans and AI systematically make different mistakes. When AI failed, a human professional could compensate for the mistake, and vice versa. This so-called error complementarity is what makes hybrid collectives so powerful. "It's not about replacing humans with machines. Rather, we should see artificial intelligence as a complementary tool that unfolds its full potential in collective decision-making," says co-author Stefan Herzog, senior research scientist at the Max Planck Institute for Human Development.
Limitations
However, the researchers also emphasise the limitations of their work. The study only looked at text-based case descriptions and not at real patients in a real clinical setting. Whether the results can be directly applied in practice remains a question that needs to be investigated in future studies. Similarly, the study focused exclusively on diagnosis, not treatment, and an accurate diagnosis does not necessarily guarantee optimal treatment.
It also remains uncertain how AI-based support systems will be accepted in practice by medical staff and patients. The potential risks of bias and discrimination by both AI and humans, particularly with regard to ethnic, social or gender differences, also require further investigation.