您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(医学版)》

山东大学学报 (医学版) ›› 2025, Vol. 63 ›› Issue (8): 51-60.doi: 10.6040/j.issn.1671-7554.0.2025.0510

• 临床研究 • 上一篇    

基于多模态解耦对比学习的癌症亚型聚类方法

张润泽1,薛付忠1,2,3,杨帆1,2,3   

  1. 1.山东大学齐鲁医学院公共卫生学院医学数据学系, 山东 济南 250012;2.国家健康医疗大数据研究院, 山东 济南 250003;3.山东大学齐鲁医院, 山东 济南 250012
  • 发布日期:2025-08-25
  • 通讯作者: 杨帆. E-mail:fanyang@sdu.edu.cn薛付忠. E-mail:xuefzh@sdu.edu.cn
  • 基金资助:
    国家自然科学基金(82273736,62272278)

Cancer subtype clustering via multimodal decoupled contrastive learning

ZHANG Runze1, XUE Fuzhong1,2,3, YANG Fan1,2,3   

  1. 1. Department of Medical Dataology, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250012, Shandong, China;
    2. National Institute of Health and Medical Big Data, Jinan 250003, Shandong, China;
    3. Qilu Hospital of Shandong University, Jinan 250012, Shandong, China
  • Published:2025-08-25

摘要: 目的 基于癌症基因组图谱(the cancer genome atlas, TCGA)中5种癌症的多组学数据,提出一种融合图卷积网络、自注意力机制与解耦对比学习的癌症亚型聚类模型。 方法 模型以TCGA数据库中5种癌症的4种组学数据为输入,分别构建每类组学中样本之间的关系网络,利用图卷积网络提取组学内部的结构信息,更好地保留样本之间的特征差异。将不同组学下的特征进行拼接,并通过注意力机制进行加权融合,自动学习各组学的重要程度与互补关系。最后采用解耦对比学习方法,利用样本增强后的不同视角进行无监督训练,引导模型在没有真实标签的情况下识别出潜在的癌症亚型。 结果 模型在5种癌症数据中均表现出良好的聚类效果,能够将样本有效划分为不同的亚型。在生存分析中,各亚型之间的生存曲线呈现显著分离,说明模型识别的亚型预后存在差异。部分亚型在临床特征上也表现出较强的区分能力。与多种现有方法相比,本研究模型在多项评价指标上均取得良好结果,聚类结果具有更高的稳定性,同时展现出更强的生物学解释能力。 结论 本研究提出的癌症亚型聚类模型通过图卷积网络、自注意力机制与对比学习的协同作用,有效整合多组学数据,显著提升了癌症亚型聚类的准确性和临床解释力,该模型为癌症异质性研究提供了新思路,有助于精准医疗的个性化治疗策略制定。

关键词: 癌症亚型聚类, 多组学, 图卷积网络, 自注意力机制, 解耦对比学习

Abstract: Objective To propose a cancer subtype clustering model that integrates graph convolutional networks, self-attention mechanisms, and decoupled contrastive learning, based on multi-omics data from five cancer types in the cancer genome atlas(TCGA). Methods The model took four types of omics data from five cancer types in the TCGA database as input. For each omics type, it constructed a sample-wise relational graph and employed a graph convolutional network(GCN)to extract intra-omics structural information, thereby better preserving inter-sample feature differences. The features from different omics were concatenated and further fused through an attention mechanism, which automatically learned the relative importance and complementary relationships among omics modalities. Finally, a decoupled contrastive learning strategy was applied, and different augmented views of the same sample were used for unsupervised training, guiding the model to identify potential cancer subtypes in the absence of ground-truth labels. Results The model demonstrated good clustering performance across five cancer datasets, effectively dividing samples into distinct subtypes. In survival analysis, the survival curves of different subtypes showed significant separation, indicating that the identified subtypes were associated with different prognoses. Some subtypes also exhibited strong differentiation in clinical characteristics. Compared with several existing methods, the proposed model achieved favorable results on multiple evaluation metrics, yielding more stable clustering outcomes and demonstrating stronger biological interpretability. Conclusion This study proposes a cancer subtype clustering model that effectively integrates multi-omics data through the synergistic use of GCN, self-attention mechanisms, and contrastive learning. The model significantly improves the accuracy and clinical interpretability of cancer subtype clustering, offering a new perspective for cancer heterogeneity research and contributing to the development of personalized treatment strategies in precision medicine.

Key words: Cancer subtype clustering, Multi-omics, Graph convolutional network, Self-attention mechanism, Decoupled contrastive learning

中图分类号: 

  • R730.43
[1] Cao W, Qin K, Li F, et al. Comparative study of cancer profiles between 2020 and 2022 using global cancer statistics(GLOBOCAN)[J]. J Natl Cancer Cent, 2024, 4(2): 128-134.
[2] Duan R, Gao L, Gao Y, et al. Evaluation and comparison of multi-omics data integration methods for cancer subtyping[J]. PLoS Comput Biol, 2021, 17(8): e1009224.doi: 10.1371/journal.pcbi.1009224
[3] Ellrott K, Wong CK, Yau C, et al. Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets[J]. Cancer Cell, 2025, 43(2): 195-212.
[4] 司呈坤. 面向组学数据的癌症亚型分类及特征选择技术研究[D]. 济南: 齐鲁工业大学, 2024.
[5] Lipkova J, Chen RJ, Chen B, et al. Artificial intelligence for multimodal data integration in oncology[J]. Cancer Cell, 2022, 40(10): 1095-1110.
[6] Wang YX, Zhang YJ. Nonnegative matrix factorization: a comprehensive review[J]. IEEE Trans Knowl Data Eng, 2012, 25(6): 1336-1353.
[7] Vahabi N, Michailidis G. Unsupervised multi-omics data integration methods: a comprehensive review[J]. Front Genet, 2022, 13: 854752.doi: 10.3389/fgene.2022.854752
[8] Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis[J]. Bioinformatics, 2009, 25(22): 2906-2912.
[9] Lim KL, Jiang X, Yi C. Deep clustering with variationalautoencoder[J]. IEEE Signal Process Lett, 2020, 27: 231-235. doi: 10.1109/LSP.2020.2965328
[10] Rong Z, Liu Z, Song J, et al. MCluster-VAEs: an end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data[J]. ComputBiol Med, 2022, 150: 106085. doi: 10.1016/j.compbiomed.2022.106085
[11] Zhou T, Li Q, Lu H, et al. GAN review: models and medical image fusion applications[J]. Inf Fusion, 2023, 91: 134-148. doi:10.1016/j.inffus.2022.10.017
[12] Ganini C, Amelio I, Bertolo R, et al. Global mapping of cancers: The Cancer Genome Atlas and beyond[J]. Mol Onco, 2021, 15(11): 2823-2840.
[13] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers[J]. Genome biol, 2011, 12: 1-14. doi: 10.1186/gb-2011-12-4-r41
[14] 李阳. 基于自注意力机制和多组学数据整合的癌症亚型识别与分类研究[D]. 重庆: 中国人民解放军陆军军医大学, 2024.
[15] 宁斌. 基于深度学习的多组学癌症亚型识别方法研究[D]. 长沙: 湖南大学, 2023.
[16] Veena EV, Pushpalatha KP. Enhanced KNN imputation for missing data[C] //International Conference on Information Technology and Applications. Singapore: Springer Nature Singapore, 2024: 583-592.
[17] Ponzi E, Thoresen M, Haugdahl Nøst T, et al. Integrative, multi-omics, analysis of blood samples improves model predictions: applications to cancer[J]. BMC bioinformatics, 2021, 22: 1-17. doi: 10.1186/s12859-021-04296-0
[18] Hasan BMS, Abdulazeez AM. A review of principal component analysis algorithm for dimensionality reduction[J]. Journal of Soft Computing and Data Mining, 2021, 2(1): 20-30.
[19] Zhao S, Zhang B, Yang J, et al. Linear discriminant analysis[J]. Nature Reviews Methods Primers, 2024, 4(1): 70. doi: 10.1038/s43586-024-00346-y
[20] Steck H, Ekanadham C, Kallus N. Is cosine-similarity of embeddings really about similarity?[EB/OL].(2024-03-08)[2025-04-26]. http://arxiv.org/abs/2403.05440
[21] Wang T, Shao W, Huang Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification[J]. Nat commun, 2021, 12(1): 3445.doi: 10.1038/s41467-021-23774-w
[22] Wang X, Qi GJ. Contrastive learning with stronger augmentations[J]. IEEE Trans Anal Mach Intell, 2022, 45(5): 5549-5560.
[23] Zhao J, Zhao B, Song X, et al. Subtype-DCC: decoupled contrastive clustering method for cancer subtype identification based on multi-omics data[J]. Brief Bioinform, 2023, 24(2): bbad025.doi: 10.1093/bib/bbad025
[24] Li Y, Hu P, Liu Z, et al. Contrastive clustering[EB/OL].(2020-09-21)[2025-04-26]. http://arxiv.org/abs/2009.09687
[25] Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggregating data types on a genomic scale[J]. Nat Methods, 2014, 11(3): 333-337.
[26] Ikotun AM, Ezugwu AE, Abualigah L, et al. K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data[J]. Inf Sci, 2023, 622: 178-210. doi: 10.1016/j.ins.2022.11.139
[27] Yang H, Chen R, Li D, et al. Subtype-GAN: a deep learning approach for integrative cancer subtyping of multi-omics data[J]. Bioinformatics, 2021, 37(16): 2231-2237.
[28] Madhumita, Dwivedi A, Paul S. Recursive integration of synergised graph representations of multi-omics data for cancer subtypes identification[J]. Sci Rep, 2022, 12(1): 15629. doi: 10.1038/s41598-022-17585-2
[29] International Cancer Genome Consortium. International network of cancer genome projects[J]. Nature, 2010, 464(7291): 993-998.
[30] Li Y, Dou Y, Leprevost FDV, et al. Proteogenomic data and resources for pan-cancer analysis[J]. Cancer Cell, 2023, 41(8): 1397-1406.
[31] Li A, Huang W, Lan X, et al. Boosting few-shot lear-ning with adaptive margin loss[C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE. 2020: 12576-12584. doi: 10.1109/CVPR42600.2020.01259
[1] 李恒渠, 王登海. 外周血液涂片镜检在血常规检查中作用[J]. 山东大学学报(医学版), 2014, 52(S2): 27-28.
[2] 张艳丽, 刘新风, 张欣, 王海燕, 杨咏梅, 杜鲁涛, 王丽丽, 李培龙, 王传新. 循环miR-128在结直肠癌患者血清中的表达及其对细胞迁移侵袭能力的影响[J]. 山东大学学报(医学版), 2014, 52(8): 57-62.
[3] 刘慧,杜鲁涛,杨咏梅,董召刚,李娟,刘益民,张欣,王丽丽,郑桂喜,王传新. MiR-182在结直肠癌中的表达及其对结直肠癌细胞迁移能力的影响[J]. 山东大学学报(医学版), 2013, 51(12): 70-74.
[4] 阎树昕1,宋贞荣2,李克成3 . 髓过氧化物酶染色阴性且无颗粒的急性早幼粒细胞白血病1例[J]. 山东大学学报(医学版), 2010, 48(12): 158-159.
[5] 方茜, 曲爱林, 张欣, 杜鲁涛, 杨咏梅, 王传新. 血清miR-210在结直肠癌患者血清中的表达及临床意义[J]. 山东大学学报(医学版), 2015, 53(6): 77-81.
[6] . 干细胞标记物LGR5在结直肠癌发生发展中的表达及意义[J]. 山东大学学报(医学版), 2009, 47(8): 85-88.
[7] . DcR3蛋白和caspase3在结直肠癌和癌前病变中的表达及其意义[J]. 山东大学学报(医学版), 2009, 47(8): 79-84.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!