Chinese generative AI models challenge western AI in clinical chemistry MCQs: A Benchmarking follow-up study on AI use in health education
Loading...
Date
2025-02-08
Journal Title
Journal ISSN
Volume Title
Publisher
Mesopotamian Press
Abstract
Background: The emergence of Chinese generative AI (genAI) models, such as DeepSeek and Qwen, has introduced strong competition to Western genAI models. These advancements hold significant potential in healthcare education. However, benchmarking the performance of genAI models in specialized medical disciplines is crucial to assess their strengths and limitations. This study builds on prior research evaluating ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard against human postgraduate students in Medical Laboratory Sciences, now incorporating DeepSeek and Qwen to assess their effectiveness in Clinical Chemistry Multiple-Choice Questions (MCQs).
Methods: This study followed the METRICS framework for genAI-based healthcare evaluations, assessing six models using 60 Clinical Chemistry MCQs previously administered to 20 MSc students. The facility index and Bloom’s taxonomy classification were used to benchmark performance. GenAI models included DeepSeek-V3, Qwen 2.5-Max, ChatGPT-4, ChatGPT-3.5, Microsoft Bing, and Google Bard, evaluated in a controlled, non-interactive environment using standardized prompts.
Results: The evaluated genAI models showed varying accuracy across Bloom’s taxonomy levels. DeepSeek-V3 (0.92) and ChatGPT-4 (1.00) outperformed humans (0.74) in the Remember category, while Qwen 2.5-Max (0.94) and ChatGPT-4 (0.94) surpassed human performance (0.61) in the Understand category. ChatGPT-4 (+23.25%, p < 0.001), DeepSeek-V3 (+18.25%, p = 0.001), and Qwen 2.5-Max (+18.25%, p = 0.001) significantly outperformed human students. Decision tree analysis identified cognitive category as the strongest predictor of genAI accuracy (p < 0.001), with Chinese AI models performing comparably to ChatGPT-4 in lower-order tasks but exhibiting lower accuracy in higher-order domains.
Conclusions: The findings highlighted the growing capabilities of Chinese genAI models in healthcare education, proving that DeepSeek and Qwen can compete with, and in some areas outperform, Western genAI models. However, their relative weakness in higher-order reasoning raises concerns about their ability to fully replace human cognitive processes in clinical decision-making. As genAI becomes increasingly integrated into health education, concerns regarding academic integrity, genAI dependence, and the validity of MCQ-based assessments must be addressed. The study underscores the need for a re-evaluation of medical assessment strategies, ensuring that students develop critical thinking skills rather than relying on genAI for knowledge retrieval.
Description
This paper examines the use of artificial intelligence in healthcare with focus on reducing medical negligence improving patient safety and strengthening accountability through proper training and regulation. It identifies gaps in legal frameworks staff capacity and risk management when new digital tools are introduced in health systems. The study recommends policy and institutional reforms to support safe ethical and transparent use of AI in healthcare delivery. The paper supports SDG 3 on good health and well being SDG 4 on quality education SDG 9 on industry innovation and infrastructure and SDG 16 on peace justice and strong institutions. It also aligns with Uganda National Development Plan IV aspirations on human capital development digital transformation improved social services and stronger governance.
Keywords
AI, Benchmarking, LLM, DeepSeek, Qwen
Citation
Sallam, M., Al-Mahzoum, K., Eid, H., Al-Salahat, K., Sallam, M., Ali, G., & Mijwil, M. M. (2025). Chinese generative AI models challenge western AI in clinical chemistry MCQs: A benchmarking follow-up study on AI use in health education. Babylonian Journal of Artificial Intelligence, 2025, 1-14.