AI chatbots fall short in open medical diagnosis

Although generative AI systems such as ChatGPT are rapidly gaining ground in healthcare, new research shows that they are still far from reliable as diagnostic tools. A recent simulation study by the University of Waterloo found that ChatGPT-4o, OpenAI's latest large language model, made correct diagnoses in open medical questions in only just over a third (37%) of cases.

This is not the first time that the accuracy and correctness of medical diagnoses made by (generative) AI tools have come up for discussion. Last year, for instance, research already showed that ChatGPT (version 3.5) made the correct diagnosis in only half (49%) of all cases. And that was with an LLM trained on a dataset of more than 400 billion words. However, another study concluded that ChatGPT-4, in certain cases, actually outperformed human doctors in making a diagnosis. In short, these earlier, and the now-published recent research, show that there is still a (long) way to go and much (additional) research will be needed.

Assessment

For the new study, published in JMIR, some 100 questions from a medical entrance exam were converted into open-ended question format, similar to how patients would describe their symptoms to a chatbot. The AI model answers were assessed by both medical students and experts. Besides the low percentage of correct answers, almost two-thirds of the responses were rated as "unclear", regardless of factual accuracy. This indicates potential risks in interpreting the output by lay people.

An illustrative example involved a patient with a skin rash on hands and wrists. ChatGPT suggested an allergic reaction to a new detergent, while the correct diagnosis - a latex allergy from wearing gloves in a mortuary - was not recognised. According to PhD student Troy Zada, first author of the study, this highlights the danger of seemingly plausible but incorrect answers. ‘People can be reassured when there is indeed a serious problem - or unnecessarily worried about an innocent complaint.’

Human intervention necessary

Although ChatGPT-4o performed better than previous versions, the study highlights the need for critical evaluation and human intervention in AI-based diagnoses. According to co-author Dr Sirisha Rambhatla, director of the Critical ML Lab, it is mainly the subtle inaccuracies that are problematic. "Big mistakes stand out. Missing fine nuances can be much more dangerous."

The researchers point out that little is yet known about how often people actually use AI for medical self-diagnosis. Yet an Australian study shows that 1 in 10 residents have consulted ChatGPT for a health problem. The research team's message is therefore clear: AI can be a valuable addition, but is not yet accurate or transparent enough to make medical diagnoses independently. ‘Use AI with common sense, but when in doubt, always see a doctor,’ Zada said.