Advancing Medical Imaging Applications with Large Language Models: The MIMIC-Diff-VQA Dataset
The exploration of large language models in medicine hinges on high-quality datasets. Chest X-ray images are prevalent in clinical diagnostics and serve as a pivotal intersection of technology and healthcare. The abundant visual data combined with detailed case reports supports the development of vision-language models. A key focus area is Medical Visual Question Answering (VQA), supported by prominent datasets like ImageCLEF-VQA-Med and VQA-RAD, which provide numerous question-answer pairs linked to chest X-rays.
Despite the extensive clinical information in X-ray reports, current medical VQA tasks offer limited question types, which narrows their clinical utility. For example, the ImageCLEF-VQA-Med dataset includes only two questions related to chest X-ray images: “Is there any abnormality in this image?” and “What is the main abnormality in this image?” While VQA-RAD provides a wider variety, it features only 315 images.
At KDD2023, researchers and radiologists from the University of Texas at Arlington, the NIH, RIKEN, the University of Tokyo, and the National Cancer Center unveiled the MIMIC-Diff-VQA dataset, designed to enhance clinical diagnostics. This dataset is grounded in radiological chest X-ray reports and introduces a broader and more detailed selection of logically progressive question-answer pairs, encompassing seven different question types.
Additionally, this research introduces a new task: difference VQA, focusing on identifying changes between two images. This task is particularly relevant for radiologists, who frequently compare past images to evaluate patient progress. Potential questions in this domain might include “What has changed compared to the previous image?” and “Has the severity of the disease decreased?”
The MIMIC-Diff-VQA dataset contains 164,654 images and 700,703 questions, setting a record for medical VQA datasets. The study also proposes a baseline VQA method leveraging Graph Neural Networks (GNNs). To manage variations in patient posture in images, the authors utilized a Faster R-CNN model to extract features, representing organs as nodes in a graph. This approach integrates implicit, spatial, and semantic relationships, incorporating medical expertise into the model. Spatial relationships refer to organ positions relative to one another, while semantic relationships involve anatomical and disease-related knowledge. Implicit relationships are enhanced through fully connected layers that complement the other two relationship types.
The research aims to propel the advancement of visual question-answering technologies in medicine, particularly by providing benchmarks for large language models like GPT-4, facilitating improved clinical decision-making and patient education.
Current Landscape of Medical Vision Language
Significant efforts have been devoted to leveraging existing medical databases for training deep learning models, with notable resources including MIMIC-CXR, NIH14, and CheXpert. These initiatives generally fall into three categories: direct classification of disease labels, medical report generation, and visual question-answering tasks.
Disease label classification usually starts with rule-based tools like NegBio and CheXpert to extract predefined labels from reports, followed by the categorization of positive and negative samples. Techniques in report generation include contrastive learning, attention models, and encoder-decoder architectures, aimed at translating image data into text that mirrors original reports.
Despite progress, challenges persist in clinical applications. For instance, Natural Language Processing (NLP) rules often struggle with uncertainty and negation in disease label classification, leading to inaccurate label extraction. Moreover, simplistic labels may fail to reflect the diversity of clinical diseases. While report generation systems can uncover latent information from images, they may not directly answer specific physician inquiries related to unique clinical scenarios.
Visual question-answering tasks emerge as a practical solution, allowing precise responses to specific questions that physicians or patients might ask, such as “Is there a pneumothorax in the image?” with a definitive response of “No.” However, existing VQA datasets like ImageCLEF-VQA-Med provide only a limited set of general questions, undermining their capacity to offer valuable clinical insights.
Although VQA-RAD includes 11 question types, its small dataset of only 315 images fails to fully utilize the potential of deep learning models, which thrive on large datasets for optimal results. To address this gap in the medical Vision Language sector, the research introduces the difference VQA task and constructs the extensive MIMIC-Diff-VQA dataset accordingly.
Overview of the MIMIC-Diff-VQA Dataset
The MIMIC-Diff-VQA dataset features 164,654 images and 700,703 questions across seven clinically relevant question types: anomaly, presence, orientation, location, severity, type, and difference. The first six categories align with traditional VQA, while the last specifically addresses differences between two images.
Developed from a wealth of chest X-ray images and radiology reports derived from MIMIC-CXR, the MIMIC-Diff-VQA dataset construction initiated with the extraction of a KeyInfo dataset. This encapsulates essential information from each report, detailing confirmed abnormal objects and their attributes, alongside any referenced negative objects.
To ensure a high standard of quality in the dataset, the researchers employed a systematic “Extract - Check - Modify” approach. Initial data extraction utilized rules guided by regular expressions, followed by verification through both manual and automated methods, such as ScispaCy for entity extraction and part-of-speech analysis.
After constructing the KeyInfo dataset, the researchers formulated relevant questions and answers based on multiple visits per patient, culminating in the creation of the MIMIC-Diff-VQA dataset.
Quality Assurance and Baseline Model
To further guarantee quality, three human validators manually reviewed 1,700 questions and answers, achieving an impressive average accuracy of 97.4%.
Complementing the dataset, the study introduced a graph-based model tailored for chest X-rays and the difference VQA task. This model accounts for variations caused by patient posture over time during imaging. By identifying anatomical structures and extracting corresponding features as graph nodes, the design aims to mitigate posture-related discrepancies.
Each node represents a blend of anatomical characteristics and question features. The study employs various pre-trained models for each anatomical structure to gather comprehensive information about potential lesions in the images.
In the “multi-relationship graph network module,” the researchers applied three graph network relationships—implicit, spatial, and semantic—to compute the final graph features. Implicit relationships are derived using fully connected networks, while spatial relationships consider 11 different connections between nodes. The semantic relationships stem from two knowledge graphs, analyzing co-occurrence relationships between diseases and anatomical links.
Finally, the study implements a global average pooling technique on the features generated by these relationships to derive the final image graph features. By subtracting feature representations from two images, the study computes the difference graph features, which are then processed via attention mechanisms to produce the final answers using an LSTM-based answer generator.
Conclusion and Future Directions
This research presents a novel medical Difference VQA task alongside the large-scale MIMIC-Diff-VQA dataset, aiming to advance technology within academia and provide robust support to the medical community, particularly in clinical decision-making and patient education.
By developing a knowledge-enhanced, multi-relationship graph network model, the study establishes benchmarks that significantly improve on current leading methods. Despite these advancements, some limitations remain, including the need to address scenarios with multiple anomalies in various locations and enhancing synonym merging. Furthermore, the model may encounter challenges like misclassifying different presentations of the same abnormality.
In summary, the MIMIC-Diff-VQA dataset and its accompanying model crucially enhance medical visual question-answering capabilities, paving the way for further research and development in this vital field.