Journal of Shandong University (Health Sciences) ›› 2025, Vol. 63 ›› Issue (8): 1-16.doi: 10.6040/j.issn.1671-7554.0.2025.0568

• Big DataEnabled, AI Foundation ModelDriven Multimodal Cohort Design and Analysis-Expert Review •    

Theoretical and methodological framework for multimodal big data cohort design based on AI language representation

XUE Fuzhong1,2,3   

  1. 1. Department of Medical Dataology, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China;
    2. National Institute of Health and Medical Big Data, Jinan 250003, Shandong, China;
    3. Qilu Hospital of Shandong University, Jinan 250012, Shandong, China
  • Published:2025-08-25

Abstract: This paper proposes a theoretical and methodological framework for multimodal cohort design based on artificial intelligence(AI)language representation, breaking through the conventional paradigm of traditional epidemiological cohort studies and establishing a novel model for language-based multimodal integration. The framework integrates heterogeneous medical data—such as health records, electronic medical records, medical imaging, and genomic information—into a unified low-dimensional embedding space using Transformer-based models. Centered on a three-layer architecture of “Digital Omics-Digital Biomarkers-Digital Phenotypes”, it introduces key methods including embedding vector generation, causal inference, and multimodal data fusion. The study innovatively defines the PICLS criteria for digital biomarkers: predictability, interpretability, computability, latent-variable structure, and stability. On this basis, digital phenotypes are further required to meet the endpoints criterion, forming the PICLSE criteria to ensure their clinical utility in disease prediction and intervention. Technically, the paper details the entire process of embedding generation, data encoding/decoding, database construction, and biomarker extraction. A case study on scarlet fever surveillance demonstrates the practical application of the proposed multimodal embedded cohort in clinical screening and intelligent early warning. This framework offers a novel paradigm for epidemiological cohort research and provides methodological support for advancing precision medicine and smart public health.

Key words: AI language representation, Multimodal cohort, Digital omics, Digital biomarkers, Digital phenotypes, PICLS/PICLSE criteria

CLC Number: 

  • R181.2+3
[1] Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence[J]. Nature, 2023, 616(7956): 259-265.
[2] Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
[3] Huang KX, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission[EB/OL].(2020-11-29)[2025-05-15]. https://arxiv.org/abs/1904.05342
[4] Peng YF, Yan SK, Lu ZY. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets[C] //Proceedings of the 18th BioNLP Workshop and Shared Task. Florence, Italy: Stroudsburg, PA, USAACL, 2019: 58-65.
[5] Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Trans Comput Healthcare, 2022, 3(1): 1-23.
[6] Shin HC, Zhang Y, Bakhturina E, et al. BioMegatron: larger biomedical domain language model[EB/OL].(2020-10-14)[2025-05-15]. https://arxiv.org/abs/2010.06060
[7] Yang X, Pournejatian NM, Shin HC, et al. GatorTron: a large clinical language model to unlock patient information from unstructured electronic health records[EB/OL].(2022-03-14)[2025-05-01]. https://arxiv.org/abs/2203.03540v2
[8] Peng C, Yang X, Chen AK, et al. A study of generative large language model for medical research and healthcare[J]. NPJ Digit Med, 2023, 6(1): 210. doi:10.1038/s41746-023-00958-w
[9] Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge[J]. Nature, 2023, 620(7972): 172-180.
[10] Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision[EB/OL].(2021-02-26)[2025-05-15]. https://arxiv.org/abs/2103.00020
[11] Wang ZF, Wu ZB, Agarwal D, et al. MedCLIP: contrastive learning from unpaired medical images and text[J]. Proc Conf Empir Methods Nat Lang Process, 2022, 2022: 3876-3887. doi:10.18653/v1/2022.emnlp-main.256
[12] Feliandra ZB, Khadijah S, Rachmadi MF, et al. Classification of stroke and non-stroke patients from human body movements using smartphone videos and deep neural networks[C] //2022 International Conference on Advanced Computer Science and Information Systems(ICACSIS). Depok, Indonesia: IEEE, 2022: 187-192.
[13] Qiu ZB, Wang HX, Liao CB, et al. Sound recognition of harmful bird species related to power grid faults based on VGGish transfer learning[J]. J Electr Eng Technol, 2023, 18(3): 2447-2456.
[14] Umirzakova S, Ahmad S, Mardieva S, et al. Deep learning-driven diagnosis: a multi-task approach for segmenting stroke and Bells palsy[J]. Pattern Recognit, 2023, 144: 109866. doi:10.1016/j.patcog.2023.109866
[15] Bannur S, Hyland S, Liu QC, et al. Learning to exploit temporal structure for biomedical vision-language processing[C] //2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Vancouver, BC, Canada: IEEE, 2023: 15016-15027.
[16] Boecking B, Usuyama N, Bannur S, et al. Making the most of text semantics to improve biomedical vision—language processing[C] //Computer Vision—ECCV 2022. Switzerland: Springer Nature, 2022: 1-21.
[17] Pearl, J. Causality: models, reasoning, and inference[M]. Cambridge, UK: Cambridge University Press, 2000.
[18] Nomura A, Takeji Y, Shimojima M, et al. Digitalomics: towards artificial intelligence/machine learning-based precision cardiovascular medicine[J]. Circ J, 2025. doi:10.1253/circj.CJ-24-0865
[19] Balasubramaniam NK, Penberthy S, Fenyo D, et al. Digitalomics-digital transformation leading to omics insights[J]. Expert Rev Proteomics, 2024, 21(9/10): 337-344.
[20] Tamura Y, Nomura A, Kagiyama N, et al. Digitalomics, digital intervention, and designing future: the next frontier in cardiology[J]. J Cardiol, 2024, 83(5): 318-322.
[21] Sameh A, Rostami M, Oussalah M, et al. Digital phenotypes and digital biomarkers for health and diseases: a systematic review of machine learning approaches utilizing passive non-invasive signals collected via wearable devices and smartphones[J]. Artif Intell Rev, 2024, 58(2): 66. doi:10.1007/s10462-024-11009-5
[22] Anderson JC, Gerbing DW. Structural equation modeling in practice: a review and recommended two-step approach[J]. Psychol Bull, 1988, 103(3): 411-423.
[23] Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria[J]. Stat Med, 1989, 8(4): 431-440.
[24] Rudolph KE, Williams NT, Diaz I. Practical causal mediation analysis: extending nonparametric estimators to accommodate multiple mediators and multiple intermediate confounders[J]. Biostatistics, 2024, 25(4): 997-1014.
[25] Alayrac JB, Donahue J, Luc P, et al. Flamingo: a visual language model for few-shot learning[EB/OL].(2022-11-15)[2025-05-15]. https://arxiv.org/abs/2204.14198
[26] Yang ZC, Wei T, Liang Y, et al. A foundation model for generalizable cancer diagnosis and survival prediction from histopathological images[J]. Nat Commun, 2025, 16(1): 2366. doi:10.1038/s41467-025-57587-y
[27] Golovanevsky M, Eickhoff C, Singh R. Multimodal attention-based deep learning for Alzheimers disease diagnosis[J]. J Am Med Inform Assoc, 2022, 29(12): 2014-2022.
[28] Wang Q, Chen K. Multi-label zero-shot human action recognition via joint latent ranking embedding[J]. Neural Netw, 2020, 122: 1-23. doi:10.1016/j.neunet.2019.09.029
[29] Yang L, Xu S, Sellergren A, et al. Advancing multimodal medical capabilities of Gemini[EB/OL].(2024-05-06)[2025-05-15]. https://arxiv.org/abs/2405.03162
[30] Oudin A, Maatoug R, Bourla A, et al. Digital phenotyping: data-driven psychiatry to redefine mental health[J]. J Med Internet Res, 2023, 25: e44502. doi:10.2196/44502
[31] Talukder AK, Schriml L, Ghosh A, et al. Diseasomics: actionable machine interpretable disease knowledge at the point-of-care[J]. PLoS Digit Health, 2022, 1(10): e0000128. doi:10.1371/journal.pdig.0000128
[32] Molina C, Prados-Suarez B. Digital phenotypes for personalized medicine[J]. Stud Health Technol Inform, 2021, 285: 141-146. doi:10.3233/SHTI210587
[33] Myszewski JJ, Klossowski E, Meyer P, et al. Validating GAN-BioBERT: a methodology for assessing reporting trends in clinical trials[J]. Front Digit Health, 2022, 4: 878369. doi:10.3389/fdgth.2022.878369
[34] Gharavi E, LeRoy NJ, Zheng GT, et al. Joint representation learning for retrieval and annotation of genomic interval sets[J]. Bioengineering, 2024, 11(3): 263. doi:10.3390/bioengineering11030263
[35] Shojaie A, Fox EB. Granger causality: a review and recent advances[J]. Annu Rev Stat Appl, 2022, 9(1): 289-319.
[36] Zeng ZX, Jiang X, Neapolitan R. Discovering causal interactions using Bayesian network scoring and information gain[J]. BMC Bioinformatics, 2016, 17(1): 221. doi:10.1186/s12859-016-1084-8
[37] Heurtel-Depeiges D, Ruoss A, Veness J, et al. Compression via pre-trained transformers: a study on byte-level multimodal data[EB/OL].(2024-10-07)[2025-05-15]. https://arxiv.org/abs/2410.05078
[38] Mital N, Özyilkan E, Garjani A, et al. Neural distributed image compression using common information[EB/OL].(2021-11-10)[2025-05-15]. https://arxiv.org/abs/2106.11723
[39] Shao ZH, Wang PY, Zhu QH, et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models[EB/OL].(2024-04-27)[2025-05-15]. https://arxiv.org/abs/2402.03300
[40] Liao SY, Chen J, Wang YZ, et al. Embedding compression with isotropic iterative quantization[J]. Proc AAAI Conf Artif Intell, 2020, 34(5): 8336-8343.
[41] Gomes C, Brunschwiler T. Neural embedding compre-ssion for efficient multi-task earth observation modelling[C] //IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium. Athens, Greece: IEEE, 2024: 8268-8273.
[42] Javed HT, Khan KU, Cheema MF, et al. Instance-based lossless summarization of knowledge graph with optimized triples and corrections(IBA-OTC)[J]. IEEE Access, 2023, 12: 5584-5604.
[1] GONG Zhuo, ZHANG Minmin, WANG Zhiping. Influence of abortion and family heredity history on the risk of uterine leiomyomas [J]. JOURNAL OF SHANDONG UNIVERSITY (HEALTH SCIENCES), 2017, 55(9): 100-104.
[2] AN Ning,LI Deng-xin,CHEN Tong,ZHANG Jian-ye. A Meta-analysis between the helicobacter pylori infection and gastric carcinoma [J]. JOURNAL OF SHANDONG UNIVERSITY (HEALTH SCIENCES), 2007, 45(4): 423-426.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!