Large language models (LLMs) and large multimodal models (LMMs) are making their way into medical settings, but these technologies have yet to be adequately tested in such critical areas.
How much can we trust these models in high-stakes, real-world scenarios? Current research from the University of California, Santa Cruz, and Carnegie Mellon University suggests, “Not much.”
In a recent experiment, researchers assessed the reliability of LMMs in medical diagnosis by exploring both general and specific diagnostic questions. They curated a new dataset and examined state-of-the-art models’ performance on X-rays, MRIs, and CT scans of human abdomens, brains, spines, and chests. The findings revealed “alarming” drops in accuracy.
Even advanced models like GPT-4V and Gemini Pro performed similarly to random educated guesses when tasked with identifying medical conditions. The introduction of adversarial pairs—slight alterations to the input—further decreased accuracy, with an average decline of 42% across the models tested. “Can we really trust AI in critical areas like medical image diagnosis? No, they are even worse than random,” stated Xin Eric Wang, a professor at UCSC and co-author of the study.
Drastic Accuracy Drops with New ProbMed Dataset
Medical Visual Question Answering (Med-VQA) assesses models’ abilities to interpret medical images. While LMMs have shown some progress on datasets such as VQA-RAD (quantitative visual questions and answers about radiology), they falter under deeper probing, according to the researchers.
To investigate further, they developed the Probing Evaluation for Medical Diagnosis (ProbMed) dataset, comprising 6,303 images from two prominent biomedical datasets featuring various scans. Researchers utilized GPT-4 to extract metadata on existing abnormalities, generating 57,132 question-answer pairs that spanned organ identification, clinical findings, and reasoning about positions.
The study involved seven state-of-the-art models, including GPT-4V and Gemini Pro, put through a rigorous probing evaluation. Researchers paired original binary diagnostic questions with adversarial queries to test the models’ ability to accurately identify true medical conditions while discounting false ones. They also required models to perform procedural diagnoses, necessitating a comprehensive approach that connects various aspects of the images.
The results were sobering: even the strongest models experienced accuracy drops of at least 10.52% on the ProbMed dataset, with an average decrease of 44.7%. For example, LLaVA-v1-7B saw a staggering 78.89% drop to only 16.5% accuracy, while Gemini Pro and GPT-4V saw drops exceeding 25% and 10.5%, respectively. “Our study reveals a significant vulnerability in LMMs when faced with adversarial questioning,” remarked the researchers.
GPT and Gemini Pro Exhibit Errors in Diagnosis
Notably, while GPT-4V and Gemini Pro excelled in general tasks like recognizing image types (CT, MRI, or X-ray) and organs, they struggled with more specialized diagnostic questions. Their accuracy resembled random guessing, demonstrating a troubling inadequacy in assisting real-life diagnoses.
When examining errors in GPT-4V and Gemini Pro, particularly during the diagnostic process, the researchers identified susceptibility to hallucination errors. Gemini Pro was prone to accepting incorrect conditions, while GPT-4V often rejected challenging inquiries. For instance, GPT-4V had an accuracy of only 36.9% for condition-related questions, and Gemini Pro performed accurately only 26% of the time on position-related queries, with 76.68% of errors stemming from hallucinations.
Conversely, specialized models like CheXagent, trained exclusively on chest X-rays, proved most accurate in identifying conditions but faltered with general tasks like organ recognition. Significantly, CheXagent demonstrated expertise transfer by accurately identifying conditions in chest CT scans and MRIs, indicating potential for cross-modality application in real-world scenarios.
“This study underscores the urgent need for more robust evaluations to ensure the reliability of LMMs in critical fields like medical diagnosis,” the researchers emphasized. Their findings highlight a significant gap between current capabilities of LMMs and the demands of real-world medical applications.
Cautious Optimism in AI Medical Applications
Experts in the medical and research communities express concerns about the readiness of AI for medical diagnosis. “Glad to see domain-specific studies corroborating that LLMs and AI should not be deployed in safety-critical infrastructure, a recent shocking trend in the U.S.,” stated Dr. Heidy Khlaaf, an engineering director at Trail of Bits. “These systems require at least 99% accuracy, and LLMs are worse than random. This is literally life-threatening.”
The sentiment is echoed by others who emphasize the necessity for domain expertise that current AI models lack. Concerns over data quality persist, with observations that companies often prioritize cost over investing in domain experts.
In conclusion, the findings from UCSC and Carnegie Melon's research underscore the pressing need for improved evaluation methodologies to ensure the reliability and effectiveness of LLMs in medical diagnosis.