应用大语言模型回答先天性晶状体脱位患儿家长提问的效果

doi:10.6040/j.issn.1671-7554.0.2025.0262

摘要/Abstract

摘要： 目的评价国内开源大语言模型(large language model, LLM)回答先天性晶状体脱位(congenital ectopia- lentis, CEL)患儿家长常见诊疗问题时的准确性、完整性及情感支持性,探讨其作为CEL患儿家长健康教育智能助手的可行性。方法构建包含33个CEL诊疗问题的题库。由3位高年资白内障科医师,采用李克特量表对Kimi chat、豆包、DeepSeek-R1 3个LLM的答案进行盲法评价。基于初步评测结果,选择综合表现最优的DeepSeek-R1在完整题库上进行全面评估。结果在3个LLM中,DeepSeek-R1表现最佳。其在全部题目中的回答准确性(≥5分)、完整性(≥2分)和情感支持性(≥2分)的比例分别为78.8%、87.9%和69.7%,评估者推荐其答案的比例为75.8%(150/198)。其回答在治疗与预后、症状等方面表现优异,但在疾病诊断方面稍欠。DeepSeek-R1的回答字数多于人工回答(P<0.05),且字数与答案完整性呈正相关(r_s≈0.608, P<0.05)。三位评分者间的一致性均高于0.700,信度良好。结论 DeepSeek-R1回答CEL相关诊疗问题具有较高的准确性、完整性和情感支持性,但其在疾病诊断方面的应用需保持谨慎。

关键词: 先天性晶状体脱位, 大语言模型, DeepSeek-R1, 健康教育, 问答性能, 生成质量

Abstract: Objective To evaluate the accuracy, completeness, and emotional supportiveness of domestic open-source large language models(LLMs)in answering common diagnostic and therapeutic questions from parents of children with congenital ectopia lentis(CEL), and to explore the feasibility of using LLMs as intelligent health education assistants for parents of CEL children. Methods A question bank comprising 33 CEL-related diagnosis and treatment questions was constructed. Three senior attending ophthalmologists specializing in cataract independently evaluated the answers generated by three LLMs(Kimi chat, Doubao, and DeepSeek-R1)using a blinded assessment method with Likert scales(1-6 for accuracy, 1-3 for completeness and emotional support). Based on preliminary evaluation results, the best-performing model overall, DeepSeek-R1, was selected for a comprehensive evaluation on the entire question bank. Results Among the three LLMs, DeepSeek-R1 performed the best. The proportions of its answers achieving accuracy(≥5 points), completeness(≥2 points), and emotional support(≥2 points)scores were 78.8%, 87.9%, and 69.7%, respectively. The evaluators recommendation rate for its answers was 75.8%(150/198). Its responses were excellent in areas such as treatment, prognosis, and symptoms, but were slightly weaker in disease diagnosis. The word count of DeepSeek-R1s responses was significantly higher than that of human answers(P<0.05), and the word count showed a positive correlation with completeness scores(r_s≈0.608, P<0.05). The intraclass correlation coefficient among the three raters for all ratings was above 0.700, indicating good reliability. Conclusion DeepSeek-R1 demonstrates high accuracy, completeness, and emotional support in answering CEL-related diagnosis and treatment questions. However, its application in disease diagnosis requires cautious interpretation and should be used under professional guidance.

Key words: Congenital ectopia lentis, Large language model, DeepSeek-R1, Health education, Question-answering performance, Generation quality

中图分类号:

R776

陈雨梦,张越,张武林,杨国兴,许衍辉,韩爱军,刘彩娟,郭雨语,陈志敏. 应用大语言模型回答先天性晶状体脱位患儿家长提问的效果[J]. 山东大学学报 (医学版), 2026, 64(5): 88-95.

CHEN Yumeng, ZHANG Yue, ZHANG Wulin, YANG Guoxing, XU Yanhui, HAN Aijun, LIU Caijuan, GUO Yuyu, CHEN Zhimin. Evaluating the efficacy of large language models in answering questions from parents of children with congenital lens dislocation[J]. Journal of Shandong University (Health Sciences), 2026, 64(5): 88-95.

参考文献

[1] Lian ZK, Hu Y, Liu ZZ, et al. Longitudinal changes of refractive error in preschool children with congenital ectopia lentis[J]. Int Ophthalmol, 2024, 44(1): 85. doi: 10.1007/s10792-024-02953-w
[2] Chandra A, Aragon-Martin JA, Hughes K, et al. A genotype-phenotype comparison of ADAMTSL4 and FBN1 in isolated ectopia lentis[J]. Invest Ophthalmol Vis Sci, 2012, 53(8): 4889-4896.
[3] Chandra A, Patel D, Aragon-Martin JA, et al. The revised Ghent nosology; reclassifying isolated ectopia lentis[J]. Clin Genet, 2015, 87(3): 284-287.
[4] Sakai LY, Keene DR, Renard M, et al. FBN1 The di-sease-causing gene for Marfan syndrome and other genetic disorders[J]. Gene, 2016, 591(1): 279-291.
[5] Evereklioglu C, Hepsen IF, Er H. Weill-Marchesani syndrome in three generations[J]. Eye(Lond), 1999, 13(6): 773-777.
[6] Morris AAM, Kožich V, Santra S, et al. Guidelines for the diagnosis and management of cystathionine beta-synthase deficiency[J]. J Inherit Metab Dis, 2017, 40(1): 49-74.
[7] Claerhout H, Witters P, Régal L, et al. Isolated sulfite oxidase deficiency[J]. J Inherit Metab Dis, 2018, 41(1): 101-108.
[8] Fuchs J, Rosenberg T. Congenital ectopia lentis, A Da-nish national survey[J]. Acta Ophthalmol Scand, 1998, 76(1): 20-26.
[9] Yang L, Wu QH, Hao YH, et al. Self-management behavior among patients with diabetic retinopathy in the community: a structural equation model[J]. Qual Life Res, 2017, 26(2): 359-366.
[10] Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum[J]. JAMA Intern Med, 2023, 183(6): 589-596.
[11] Sinsky CA, Shanafelt TD, Ripp JA. The electronic health record inbox: recommendations for relief[J]. J Gen Intern Med, 2022, 37(15): 4002-4003.
[12] Holmgren AJ, Byron ME, Grouse CK, et al. Association between billing patient portal messages as e-visits and patient messaging volume[J]. JAMA, 2023, 329(4): 339-342.
[13] Stroop A, Stroop T, Zawy Alsofy S, et al. Large language models: Are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery?[J]. Eur Spine J, 2024, 33(11): 4135-4143.
[14] Kusunose K, Kashima S, Sata M. Evaluation of the accuracy of ChatGPT in answering clinical questions on the Japanese society of hypertension guidelines[J]. Circ J, 2023, 87(7): 1030-1033.
[15] Saibene AM, Allevi F, Calvo-Henriquez C, et al. Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation[J]. Eur Arch Otorhinolaryngol, 2024, 281(4): 1835-1841.
[16] Cheong KX, Zhang CX, Tan TN, et al. Comparing gen-erative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy[J]. Br J Ophthalmol, 2024, 108(10): 1443-1449.
[17] Thirunavukarasu AJ, Hassan R, Mahmood S, et al. Trialling a large language model(ChatGPT)in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care[J]. JMIR Med Educ, 2023, 9: e46599.
[18] Athaluri SA, Manthena SV, Kesapragada VSRKM, et al. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references[J]. Cureus, 2023, 15(4): e37432. doi: 10.7759/cureus.37432
[19] 王子星, 齐乐, 廉晓丹, 等. 医疗领域聊天机器人的发展与应用:从传统方法到大语言模型[J]. 协和医学杂志, 2025, 16(5): 1170-1178. WANG Zixing, QI Le, LIAN Xiaodan, et al. The development and application of chatbots in healthcare: from traditional methods to large language models[J]. Medical Journal of Peking Union Medical College Hospital, 2025, 16(5): 1170-1178.
[20] Tonsaker T, Bartlett G, Trpkov C. Health information on the Internet: gold mine or minefield?[J]. Can Fam Physician, 2014, 60(5): 407-408.
[21] Vaira LA, Lechien JR, Abbate V, et al. Accuracy of ChatGPT-generated information on head and neck and oromaxillofacial surgery: a multicenter collaborative analysis[J]. Otolaryngol Head Neck Surg, 2024, 170(6): 1492-1503.
[22] Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum[J]. JAMA Intern Med, 2023, 183(6): 589-596.
[23] Link E, Baumann E. Use of health information on the Internet: personal and motivational influencing factors[J]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz, 2020, 63(6): 681-689.
[24] Cakir H, Caglar U, Halis A, et al. Assessing the know-ledge of ChatGPT in answering questions regarding female urology[J]. Urol J, 2024, 21(6): 410-414.
[25] Aydın FO, Aksoy BK, Ceylan A, et al. Readability and appropriateness of responses generated by ChatGPT 3.5, ChatGPT 4.0, gemini, and microsoft copilot for FAQs in refractive surgery[J]. Turk J Ophthalmol, 2024, 54(6): 313-317.
[26] Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
[26] Ali S, Abdullah, Armand TPT, et al. Metaverse in healthcare integrated with explainable AI and blockchain: enabling immersiveness, ensuring trust, and providing patient data security[J]. Sensors(Basel), 2023, 23(2): 565. doi: 10.3390/s23020565
[27] Kelly CJ, Karthikesalingam A, Suleyman M, et al. Key challenges for delivering clinical impact with artificial intelligence[J]. BMC Med, 2019, 17(1): 195. doi: 10.1186/s12916-019-1426-2
[28] Khanna RK, Ducloyer JB, Hage A, et al. Evaluating the potential of ChatGPT-4 in ophthalmology: the good, the bad and the ugly[J]. J Fr Ophtalmol, 2023, 46(7): 697-705.
[29] Rasu RS, Bawa WA, Suminski R, et al. Health literacy impact on national healthcare utilization and expenditure[J]. Int J Health Policy Manag, 2015, 4(11): 747-755.
[30] 高飞, 高雪, 邵彦, 等. 大语言模型在糖尿病视网膜病变患者健康教育中的应用[J]. 中华实验眼科杂志, 2024, 42(12): 1111-1118. GAO Fei, GAO Xue, SHAO Yan, et al. Application of large language models in health education for patients with diabetic retinopathy[J]. Chinese Journal of Experimental Ophthalmology, 2024, 42(12): 1111-1118.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed