Evaluating advanced artificial intelligence in oncology education and clinical knowledge assessment
DOI:
https://doi.org/10.18203/2320-6012.ijrms20252009Keywords:
Artificial intelligence, Medical oncology, Multimodal large language model, ChatGPTAbstract
Background: The rapid advancement of artificial intelligence (AI), particularly multimodal large language models (MLLMs), holds promise for revolutionizing oncology practices. This study evaluates the performance of two MLLMs, GPT-4o and Gemini Advanced, in answering oncology examination questions from the American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) question bank.
Methods: A total of 832 multiple-choice questions covering various oncological tasks were extracted from the ASCO-SEP question bank. Both models were independently presented with these questions, and their responses were compared to the official answer key. Statistical analyses were performed to assess accuracy differences between the models.
Results: Gemini advanced outperformed GPT-4o, achieving 74.84% accuracy compared to 60% for GPT-4o (p=0.025). Gemini advanced consistently excelled across all task categories, particularly in making diagnoses, ordering and interpreting test results, and recommending treatment. Both models struggled with questions related to pathophysiology and basic science knowledge.
Conclusions: While both MLLMs demonstrate significant understanding of oncological knowledge, Gemini Advanced shows superior performance, highlighting the influence of model architecture and training data. These findings underscore the potential of AI in augmenting clinical practice and medical education but emphasize the need for further improvements, particularly in handling complex clinical scenarios and integrating foundational science knowledge.
Metrics
References
Raiaan MAK, Mukta MSH, Fatema K, Fahad NM, Sakib S, Mim MMJ, et al. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access. 2024;12:26839-74. DOI: https://doi.org/10.1109/ACCESS.2024.3365742
Butte AJ. Artificial Intelligence-From Starting Pilots to Scalable Privilege. JAMA Oncol. 2023;9(10):1341-2. DOI: https://doi.org/10.1001/jamaoncol.2023.2867
Rane N, Choudhary S, Rane J. Gemini Versus ChatGPT: Applications, Performance, Architecture, Capabilities, and Implementation. Rochester, NY; 2024.
Ahmed Y. Utilization of ChatGPT in Medical Education: Applications and Implications for Curriculum Enhancement. Acta Inform Medica. 2023;31(4):300-5. DOI: https://doi.org/10.5455/aim.2023.31.300-305
Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent Off Publ Am Acad Esthet Dent Al. 2023;35(7):1098-102. DOI: https://doi.org/10.1111/jerd.13046
Weng TL, Wang YM, Chang S, Chen TJ, Hwang SJ. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc JCMA. 2023;86(8):762-6. DOI: https://doi.org/10.1097/JCMA.0000000000000946
Le M, Davis M. ChatGPT Yields a Passing Score on a Pediatric Board Preparatory Exam but Raises Red Flags. Glob Pediatr Health. 2024;11:2333794X241240327. DOI: https://doi.org/10.1177/2333794X241240327
Skalidis I, Cagnina A, Luangphiphat W, Mahendiran T, Muller O, Abbe E, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023;4(3):279-81. DOI: https://doi.org/10.1093/ehjdh/ztad029
Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol. 2023;118(12):2280-2. DOI: https://doi.org/10.14309/ajg.0000000000002320
Mihalache A, Popovic MM, Muni RH. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmol. 2023;141(6):589-97. DOI: https://doi.org/10.1001/jamaophthalmol.2023.1144
Luo H, Yan J, Zhou X. Evaluating artificial intelligence responses to respiratory medicine questions. Respirology. 2024;29(7):640-3. DOI: https://doi.org/10.1111/resp.14733
Nicikowski J, Szczepański M, Miedziaszczyk M, Kudliński B. The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland. Clin Kidney J. 2024;17(8):sfae193. DOI: https://doi.org/10.1093/ckj/sfae193
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. DOI: https://doi.org/10.1371/journal.pdig.0000198
Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ. 2024;24(1):1013. DOI: https://doi.org/10.1186/s12909-024-05944-8
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024;26(1):e60807. DOI: https://doi.org/10.2196/60807
ASCO-SEP for Training Programs - Informational Page. ASCO Education. Available at: https://educ ation.asco.org/product-details/ascoSEPtraining programs. Accessed on 05 March 2025.
Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, et al. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open. 2024;7(6):e2417641. DOI: https://doi.org/10.1001/jamanetworkopen.2024.17641
Chen S, Kann BH, Foote MB, Aerts HJWL, Savova GK, Mak RH, et al. Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncol. 2023;9(10):1459-62. DOI: https://doi.org/10.1001/jamaoncol.2023.2954
Barbour AB, Barbour TA. A Radiation Oncology Board Exam of ChatGPT. Cureus. 2023;15(9):e44541. DOI: https://doi.org/10.7759/cureus.44541
Chow R, Hasan S, Zheng A, Gao C, Valdes G, Yu F, et al. The Accuracy of Artificial Intelligence ChatGPT in Oncology Examination Questions. J Am Coll Radiol. 2024;21(11):1800-4. DOI: https://doi.org/10.1016/j.jacr.2024.07.011
Odabashian R, Bastin D, Jones G, Manzoor M, Tangestaniapour S, Assad M, et al. Assessment of ChatGPT-3.5’s Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks. JMIR AI. 2024;3(1):e50442. DOI: https://doi.org/10.2196/50442
Filippov E, Lizogub O, Kovalenko I, Golubykh K, Khunkhun R. Performance of ChatGPT on the European Society for Medical Oncology (ESMO) Exam: Comparative Analysis (Preprint). JMIR Prepr. 2024;4:2401-4. DOI: https://doi.org/10.2196/preprints.56450
Hochmair HH, Juhász L, Kemp T. Correctness Comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks. Transac GIS. 2024;28(7):2219-31. DOI: https://doi.org/10.1111/tgis.13233
Rane N, Choudhary S, Rane J. Gemini Versus ChatGPT: Applications, Performance, Architecture, Capabilities, and Implementation [Internet]. Rochester, NY; 2024. DOI: https://doi.org/10.2139/ssrn.4723687
Pan A, Musheyev D, Bockelman D, Loeb S, Kabarriti AE. Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer. JAMA Oncol. 2023;9(10):1437-40. DOI: https://doi.org/10.1001/jamaoncol.2023.2947
Beard D. How Hard Is the ABIM Certification Exam? ABIM Exam Explained. MedChallenger. 2022. Available at: https://challengercme.com/blog/articles/ 2022/06/how-hard-is-the-abim-internal-medicine-board-exam. Accessed on 05 March 2025.
Ahmed Y, Taha MH, Khayal S. Integrating Research and Teaching in Medical Education: Challenges, Strategies, and Implications for Healthcare. J Adv Med Educ Prof. 2024;12(1):1-7.