大型语言模型在骨科手术术前管理中的决策性能及辅助价值

doi:10.6040/j.issn.1671-7554.0.2025.1327

摘要/Abstract

摘要： 目的探讨大型语言模型(如DeepSeek、ChatGPT等)的不同生成模式在术前管理领域的应用效果及对低年资医生的辅助决策价值。方法随机选取2025年1月至2025年8月山东大学齐鲁医院医院骨科住院患者100例病历,排除预计施行一级、二级、三级手术及非关节置换手术患者,最终纳入患者87例。在PubMed和UpToDate数据库检索围术期管理相关指南,将检索到的指南经文本处理和向量化后,构建围术期管理知识库,为后续模型调用与问答提供外部知识支持。患者病历匿名化处理后上传到DeepSeek模型不同版本［DeepSeek Chat版本(V3版本)、DeepSeek Chat+知识库版本、DeepSeek 深度思考版本(R1版本)及DeepSeek R1+知识库版本］中,以相同的“指令-上下文-输入-输出(Instruction-Context-Input-Output, ICIO)”提示词框架提问,对模型输出的结果进行客观与主观评估。结果 DeepSeek R1模型在术前改良心脏风险指数(revised cardiac risk index, RCRI)评分与风险分级任务中的正确率分别为75.86%和78.16%,显著优于Chat系列模型。4个版本模型在美国麻醉医师协会身体状况分级系统(American society of anesthesiologists,ASA)评分与手术可行性判断中的正确率均处于中等水平,其中R1版本表现略优。知识库的引入仅在Chat版本中对RCRI评分准确率有轻微提升(+4.6%),但在R1版本中反而降低性能。主观评估结果显示,低年资医生普遍认为R1系列模型回答更具临床参考价值,其平均评分(4.19±0.72)显著高于Chat系列(Chat版本为3.06±0.06,Chat+知识库版本为2.97±0.03),提示R1模型在术前决策支持中具有更强的实用性与可接受性(P<0.05)。结论 DeepSeek R1模型在骨科术前麻醉风险评估与临床辅助决策中展现出良好的应用潜力,但知识库构建及任务适配仍需进一步优化,以提升模型在真实临床场景下的可靠性与可推广性。

关键词: 大语言模型, DeepSeek, 术前决策, 知识库, 改良心脏风险指数评分

Abstract: Objective To explore the application effectiveness of different generation modes of large language models(such as DeepSeek, ChatGPT, etc.)in the field of preoperative management and their value in assisting decision-making processes for junior physicians. Methods A total of 100 medical history records of orthopedic inpatients at Qilu Hospital of Shandong University were randomly selected from January to August 2025. Patients who were scheduled to undergo Grade I, II, III surgeries and non-joint replacement surgeries were excluded, resulting in the inclusion of total 87 patients. Guidelines related to perioperative management were retrieved from databases such as PubMed and UpToDate. After text processing and vectorization, these guidelines were used to build a perioperative management knowledge base, providing external knowledge support for subsequent model calls and question-answering tasks. The anonymized patient records were uploaded to different versions of the DeepSeek model ［DeepSeek Chat version(V3), DeepSeek Chat + knowledge base version, DeepSeek Deep Thinking version(R1), and DeepSeek R1 + knowledge base version］, and questions were posed under the identical “Instruction-Context-Input-Output(ICIO)” prompt framework. The model outputs were evaluated both objectively and subjectively. Results The DeepSeek R1 model achieved accuracy rates of 75.86% and 78.16% in the Revised Cardiac Risk Index(RCRI)scoring and risk classification tasks, respectively, significantly outperforming the Chat series models. All four model versions showed moderate accuracy in the American Society of Anesthesiologists(ASA)physical status classification and surgical feasibility judgment, with the R1 version performing slightly better. The introduction of the knowledge base slightly improved RCRI scoring accuracy only in the Chat version(+4.6%)but reduced performance in the R1 version. Subjective evaluation results indicated that junior physicians generally considered the R1 series models answers to be of greater clinical reference value, with an average score(4.19±0.72)significantly higher than that of the Chat series(Chat version: 3.06±0.06; Chat + knowledge base version: 2.97±0.03). This suggested that the R1 model has stronger practicality and acceptability in preoperative decision support(P<0.05). Conclusion The DeepSeek R1 model demonstrates good application potential in orthopedic preoperative anesthesia risk assessment and clinical decision support. However, knowledge base building and task adaptation require further optimization to enhance the models reliability and generalizability in real clinical scenarios.

Key words: Large language model, DeepSeek, Preoperative decision-making, Knowledge base, Revised cardiac risk index score

中图分类号:

R684

魏书生,吴海波,李松林,温镇璘,杨昌骜,卢群山,刘培来. 大型语言模型在骨科手术术前管理中的决策性能及辅助价值[J]. 山东大学学报 (医学版), 2026, 64(2): 104-110.

WEI Shusheng, WU Haibo, LI Songlin, WEN Zhenlin, YANG Changao, LU Qunshan, LIU Peilai. Decision performance and auxiliary value of large language models in preoperative management of orthopedic surgery[J]. Journal of Shandong University (Health Sciences), 2026, 64(2): 104-110.

参考文献

[1] 谢昉, 冯艳, 孙德峰. 围手术期规范化麻醉评估流程在日间手术中的应用[J]. 华西医学, 2021, 36(2): 144-151. XIE Fang, FENG Yan, SUN Defeng. Role of perioperative standardized anesthesia evaluation in day surgery[J]. West Chin Med J, 2021, 36(2): 144-151.
[2] 郭振江, 王宁, 赵光远, 等. 基于机器学习建立术前预测近端胃癌食管切缘阳性模型[J]. 山东大学学报(医学版), 2024, 62(7): 78-83. GUO Zhenjiang, WANG Ning, ZHAO Guangyuan, et al. Development of preoperative models for predicting positive esophageal margin in proximal gastric cancer based on machine learning[J]. Journal of Shandong University(Health Sciences), 2024, 62(7): 78-83.
[3] Selpien H, Penon J, Thunecke D, et al. Adjustment of positive end-expiratory pressure based on body mass index during general anaesthesia: a randomised controlled trial[J]. Anaesthesia, 2025, 80(11): 1322-1332.
[4] Lin C, Abboud S, Zoghbi V, et al. Suprazygomatic maxillary nerve blocks and opioid requirements in pediatric adenotonsillectomy: a randomized clinical trial[J]. JAMA Otolaryngol Head Neck Surg, 2024, 150(7): 564. doi:10.1001/jamaoto.2024.1011
[5] 王文奇, 郭梦帆, 杨杜祥, 等. 大语言模型发展与应用综述[J]. 中原工学院学报, 2025, 36(2): 1-8. WANG Wenqi, GUO Mengfan, YANG Duxiang, et al. Overview of the development and applications of large language models[J]. Journal of Zhongyuan University of Technology, 2025, 36(2): 1-8.
[6] Shool S, Adimi S, Saboori Amleshi R, et al. A systematic review of large language model(LLM)evaluations in clinical medicine[J]. BMC Med Inform Decis Mak, 2025, 25(1): 117. doi:10.1186/s12911-025-02954-4
[7] 薛东, 杨思毅, 杜晗, 等. 大语言模型的发展现状及引信行业赋能路径展望[J]. 探测与控制学报, 2025, 47(4): 9-20. XUE Dong, YANG Siyi, DU Han, et al. The large language models development status and outlook on empowering fuze industry[J]. Journal of Detection Control, 2025, 47(4): 9-20.
[8] Liu BHM, Lin YZ, Long X, et al. Utilizing AI for the identification and validation of novel therapeutic targets and repurposed drugs for endometriosis[J]. Adv Sci, 2025, 12(5): 2406565. doi:10.1002/advs.202406565
[9] Brügge E, Ricchizzi S, Arenbeck M, et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial[J]. BMC Med Educ, 2024, 24(1): 1391. doi:10.1186/s12909-024-06399-7
[10] Ye XD, Shan XF, Tu YF, et al. Examining the efficacy of large language models for mitigating depression and anxiety among Chinese students: a randomized controlled trial[J]. CIN Comput Inform Nurs, 2025, 43(9):e01349. doi:10.1097/cin.0000000000001349
[11] 陈紫林, 祝帆帆, 罗宇昕, 等. 大语言模型在医疗健康领域的应用现状与前景展望[J]. 医学与哲学, 2025, 46(12): 32-37. CHEN Zilin, ZHU Fanfan, LUO Yuxin, et al. Overview of the development and applications of large language models[J]. Medicine Philosophy, 2025, 46(12): 32-37.
[12] 张晓波, 冯瑞, 杨睿, 等. DeepSeek赋能的儿科全流程智慧医疗系统的构建和应用效果评价[J]. 中国循证儿科杂志, 2025, 20(3): 217-222. ZHANG Xiaobo, FENG Rui, YANG Rui, et al. A DeepSeek-enabled intelligent pediatric healthcare system: construction and application effectiveness evaluation[J]. Chinese Journal of Evidence-Based Pediatrics, 2025, 20(3): 217-222.
[13] Uzel K, Azboy(·overI), Parvizi J. Venous thromboembolism in orthopedic surgery: global guidelines[J]. Acta Orthop Traumatol Turc, 2023, 57(5): 192-203.
[14] Sigmund A, Russell LA. Optimizing rheumatoid arthritis patients for surgery[J]. Curr Rheumatol Rep, 2018, 20(8): 48. doi:10.1007/s11926-018-0757-x
[15] Grits D, Kuo A, Acuña AJ, et al. The association between perioperative blood transfusions and venous thromboembolism risk following surgical management of hip fractures[J]. J Orthop, 2022, 34: 123-131. doi:10.1016/j.jor.2022.08.016
[16] Arraut J, Thomas J, Oakley CT, et al. The AAHKS best podium presentation research award: a second dose of dexamethasone reduces postoperative opioid consumption and pain in total joint arthroplasty[J]. J Arthroplasty, 2023, 38(7): S21-S28.
[17] Santos Gomes MA, Kovaleski JL, Pagani RN, et al. Machine learning applied to healthcare: a conceptual review[J]. J Med Eng Technol, 2022, 46(7): 608-616.
[18] Rashidi HH, Pantanowitz J, Hanna MG, et al. Introduction to artificial intelligence and machine learning in pathology and medicine: generative and nongenerative artificial intelligence basics[J]. Mod Pathol, 2025, 38(4): 100688. doi:10.1016/j.modpat.2024.100688
[19] Cheng TT, Li Y, Gu JQ, et al. The performance of ChatGPT in day surgery and pre-anesthesia risk assessment: a case-control study of 150 simulated patient pre-sentations[J]. Perioper Med, 2024, 13(1): 111. doi:10.1186/s13741-024-00469-6
[20] Abdel Malek M, van Velzen M, Dahan A, et al. Gene-ration of preoperative anaesthetic plans by ChatGPT-4.0: a mixed-method study.[J]. Br J Anaesth, 2025, 134(5):1333-1340.
[21] Pedrosa E, Silva M, Lobo A, et al. Is the ASA classification universal?[J]. Turk J Anaesthesiol Reanim, 2021, 49(4): 298-303.
[22] Lee TH, Marcantonio ER, Mangione CM, et al. Derivation and prospective validation of a simple index for prediction of cardiac risk of major noncardiac surgery[J]. Circulation, 1999, 100(10): 1043-1049.
[23] Omiye JA, Gui HW, Rezaei SJ, et al. Large language models in medicine: the potentials and pitfalls: a narrative review[J]. Ann Intern Med, 2024, 177(2): 210-220.
[24] Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making[J]. Nat Med, 2025, 31(8): 2546-2549.
[25] Jebb AT, Ng V, Tay L. A review of key likert scale development advances: 1995-2019[J]. Front Psychol, 2021, 12: 637547. doi:10.3389/fpsyg.2021.637547
[26] Wysocka M, Wysocki O, Delmas M, et al. Large language Models, scientific knowledge and factuality: a framework to streamline human expert evaluation[J]. J Biomed Inform, 2024, 158: 104724. doi:10.1016/j.jbi.2024.104724
[27] Bedi S, Liu YT, Orr-Ewing L, et al. Testing and evaluation of health care applications of large language models: a systematic review[J]. Jama, 2025, 333(4): 319. doi:10.1001/jama.2024.21700
[28] Peng YF, Malin BA, Rousseau JF, et al. From GPT to DeepSeek: significant gaps remain in realizing AI in healthcare[J]. J Biomed Inform, 2025, 163: 104791. doi:10.1016/j.jbi.2025.104791
[29] Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making[J]. Nat Med, 2024, 30(9): 2613-2622.
[30] Tordjman M, Liu ZL, Yuce M, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning[J]. Nat Med, 2025, 31(8): 2550-2555.
[31] 巴宏军, 陈佳睿, 胡晗, 等. 住院医师对人工智能应用的认知与态度调查[J]. 中华医学教育杂志, 2025, 45(3):194-197. BA Hongjun, CHEN Jiarui, HU Han, et al. Survey on residents perception and attitudes towards the application of artificial intelligence[J]. Chinese Journal of Medical Education, 2025, 45(3): 194-197.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed