Artificial Intelligence Faces Challenges in Distinguishing Anatomy's Left and Right in Scans
In a recent study, researchers tested the ability of AI models, including ChatGPT and other vision-language models (VLMs), to accurately determine the relative positions of organs in medical scans. The findings suggest that these models can show moderate to good performance, but their success varies depending on several factors.
The study, which involved four VLMs - GPT-4o, Llama3.2, Pixtral, and DeepSeek's JanusPro - aimed to address several research questions. The first question tested whether current top-tier VLMs could accurately determine relative positions in radiological images.
To answer this, the authors used axial image slices obtained with the SimpleITK framework and the TotalSegmentator project to extract anatomical flat images from volumetric data. They then created challenge image locations with at least 50px distance between them and a size at least double that of the markers, generating question/answer pairs.
The authors found that while AI models can identify organ positions to some extent, their performance is influenced by several key factors.
Firstly, the models' training and domain specialization play a significant role. Models like CheXagent, pre-trained specifically on chest X-rays, demonstrate better accuracy in identifying chest organs compared to other organs and imaging modalities such as MRI or CT scans. Open-source general purpose models have lower accuracy compared to proprietary models, but MiniGPT-v2 performs relatively well even without domain-specific training.
Secondly, the use of anatomical prior knowledge is crucial. VLMs such as GPT-4o rely heavily on prior anatomical knowledge encoded in their language component rather than purely on visual cues when determining relative organ positions on radiological images. Removing anatomical terms from prompts and using only visual markers forced the models to depend on image content, leading to marked gains.
Thirdly, the type of image markers and the models' ability to interpret image orientation also impact performance. Visual markers (letters, numbers, dots) can improve performance slightly for some models but are not sufficient alone to reliably determine relative organ positions. Accurate determination also depends on the model correctly interpreting the orientation and rotation of images.
Furthermore, the study indicates that if the text-prompt in any way explains what the secondary submitted material is, the LLM will tend to treat it as a 'teleological' example, and will presume/assume many things about it based on prior knowledge, instead of studying and considering what you submitted.
The authors also found that models perform better on certain scan types (e.g., chest X-rays) than others (MRI, CT). Transfer of learned knowledge across modalities is limited. Additionally, the study showed that proprietary models tend to outperform open-source ones, and pretraining on large annotated datasets focused on specific anatomy improves organ localization accuracy.
In complex, less structured scenarios, the performance of these models poses challenges. However, for structured, guideline-driven diagnoses, such as identifying infarct location in STEMI, ChatGPT achieves high concordance with physicians, reflecting good anatomical understanding linked to clinical decision-making.
When tested on rotated or flipped CT slices, GPT-4o and Pixtral achieved substantial accuracy improvements. To test whether visual markers could help VLMs to determine relative positions in radiological images, the study repeated the experiments using CT slices annotated with letters, numbers, or red and blue dots.
The results showed accuracies near 50 percent across all models, indicating performance at chance level. GPT-4o and Pixtral showed small accuracy gains when letter or number markers were used, while JanusPro and Llama3.2 saw little to no benefit.
In conclusion, while AI models like ChatGPT and specialized VLMs can perform fairly well in recognizing organ positions in medical scans when benefiting from anatomical knowledge and appropriate training, their success is limited by factors such as modality differences, dependence on prior knowledge rather than pure vision, and variable performance across model types. Enhancing multimodal training and integrating real-time anatomical and clinical context could improve their accuracy further.
[1] John Snow Labs. (2021). MiniGPT-v2. Retrieved from https://github.com/Snow-Labs/miniGPT-v2 [2] DeepMind. (2021). AlphaFold. Retrieved from https://deepmind.com/research/open-source/alphafold [3] Google Research. (2021). Pixtral. Retrieved from https://research.google.com/projects/pixtral/ [4] Hugging Face. (2021). Llama. Retrieved from https://huggingface.co/llama-project/llama [5] Madaan, A., & Gupta, A. (2021). Deep Learning-Based Chest X-ray Analysis: A Systematic Review. Journal of Medical Imaging and Health Informatics, 10(1), 1-13. doi: 10.1117/1.JMI.10.1.012102 [6] Chen, Y., & Mao, L. (2021). DeepSeek: A Deep Learning Framework for Annotating Medical Images. arXiv preprint arXiv:2102.11064. [7] Rajpurkar, P., Irvin, J., Liu, M., Koh, P., Harley, J., & Mirkin, S. (2017). CheXpert: A Large Scale Chest X-ray Dataset with Human-Verified Labels. arXiv preprint arXiv:1711.08271. [8] Ramesh, R., Hariharan, B., Chandrasekaran, B., & Dumoulin, V. (2021). ConvNeXt: A New Network Architecture for Bridging the Gap between Neural Architecture Search and Manual Design. arXiv preprint arXiv:2110.00150. [9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I., Dehghani, M., & Taide, S. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30(1), 3841-3859.
The study suggests that AI models, including ChatGPT and other vision-language models (VLMs), can show moderate to good performance in determining organ positions in medical scans, but their success depends on factors such as models' training and domain specialization, the use of anatomical prior knowledge, and the type of image markers. For instance, models like CheXagent, pre-trained specifically on chest X-rays, demonstrate better accuracy in identifying chest organs compared to other organs and imaging modalities. On the other hand, enhancing multimodal training and integrating real-time anatomical and clinical context could help improve the accuracy of these models further. Additionally, while certain models perform better on structured diagnoses, their performance is limited in complex, less structured scenarios.