山东大学学报(医学版) ›› 2016, Vol. 54 ›› Issue (4): 89-93.doi: 10.6040/j.issn.1671-7554.0.2015.1186
公晓云1,申小涛2,徐静1,张涛1,朱正江2,薛付忠1
GONG Xiaoyun1, SHEN Xiaotao2, XU Jing1, ZHANG Tao1, ZHU Zhengjiang2, XUE Fuzhong1
摘要: 目的 探讨在代谢组学数据中服从正态分布的变量个数逐步增加时统计分类方法分类准确率的变化趋势。 方法 首先模拟产生11组代谢数据,且数据中服从正态分布的变量逐渐增加,然后用传统的非机器学习统计方法[Bayes判别、Fisher判别、偏最小二乘判别分析(PLS-DA)]和机器学习方法[随机森林(RF)、支持向量机(SVM)]进行统计分析,比较分类准确率的变化;最后用两个实例分析对模拟结果的合理性进行评价。 结果 代谢组学数据正态性对Bayes判别、Fisher判别、PLS-DA的分析结果影响较大,随着数据中服从正态分布的变量个数增加,分类准确率增大,而对RF和SVM基本没有影响。 结论 传统的非机器学习方法在统计分析过程中对数据正态性有一定的要求,而机器学习类的方法对数据正态性基本没有要求,且分类准确率一直保持较高的稳定状态。
中图分类号:
[1] Collino S, Martin FP, Rezzi S, et al. Clinical metabolomics paves the way towards future healthcare strategies[J]. Br J Clin Pharmacol, 2013, 75(3): 619-629. [2] Wettersten HI, Weiss RH. Applications of metabolomics for kidney disease research: from biomarkers to therapeutic targets[J]. Organogenesis, 2013, 9(1): 11-18. [3] Tokushige K, Hashimoto E, Kodama K, et al. Serum metabolomic profile and potential biomarkers for severity of fibrosis in nonalcoholic fatty liver disease[J]. Gastroenterol, 2013, 48(12): 1392-1400. [4] Marengo E, Robotti E. Biomarkers for pancreatic cancer: recent achievements in proteomics and genomics through classical and multivariate statistical methods[J]. World J Gastroenterol, 2014, 20(37): 13325-13342. [5] Xia J, Broadhurst DI, Wilson M, et al. Translational biomarker discovery in clinical metabolomics: an introductory tutorial[J]. Metabolomics, 2013, 9(2): 280-299. [6] Wood PL. Mass spectrometry strategies for clinical metabolomics and lipidomics in psychiatry, neurology, and neuro-oncology[J]. Neuropsychopharmacology, 2014, 39(1): 24-33. [7] Parsons HM, Ludwig C, Gunther UL, et al. Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation[J]. BMC Bioinformatics, 2007, 8: 234. doi:10.1186/1471-2105-8-234. [8] Worley B, Powers R. Multivariate analysis in metabolomics[J]. Curr Metabolomics, 2013, 1(1): 92-107. [9] Gu HW, Pan ZZ, Xi B, et al. Principal component directed partial least squares analysis for combining NMR and MS data in metabolomics: application to the detection of breast cancer[J]. Anal Chim Acta, 2011, 686(1-2): 57-63. [10] Chen TL, Yu C, Zhang YN, et al. Random Forest in clinical metabolomics for phenotypic discrimination and biomarker selection[J]. Evid Based Complement Alternat Med, 2013:298183. doi:10.1155/2013/298183. Epub 2013 Feb 2. [11] 刘盈君,张涛,王璐,等. 基于随机森林的精神分裂症血清代谢组学研究[J].山东大学学报(医学版),2015,53(2):92-96. LIU Yingjun, ZHANG Tao, WANG Lu, et al. Serum metabolic profiling of schizophrenia based on random forest[J]. Journal of Shandong University(Health Sciences), 2015, 53(2): 92-96. [12] 王璐,张涛,刘佳,等. 模糊聚类法在动态设计组学数据趋势聚类中的应用[J].中国卫生统计,2015, 32(1): 2-5. WANG Lu, ZHANG Tao, LIU Jia, et al. Clustering the dynamic profile of dynamic omics data using soft clustering Method[J]. Chinese Journal of Health Statistics, 2015, 32(1): 2-5. [13] Muncey HJ, Jones R, De Iorio M, et al. MetAssimulo: simulation of realistic NMR metabolic profiles[J]. BMC Bioinformatics, 2010, doi:10.1186/1471-2105-11-496. [14] van den Berg RA, Hoefsloot HC, Westerhuis JA, et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data[J]. BMC Genomics, 2006, doi:10.1186/1471-2164-7-142. [15] Yang J, Zhao X, Lu X, et al. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis[J]. Front Mol Biosci, 2015, 2: 4. doi: 10.3389/fmolb.2015.00004 [16] Mitchell MW. A comparison of aggregate p-value methods and multivariate statistics for self-contained tests of metabolic pathway analysis[J]. PLoS One, 2015, 10(4): e0125081. doi:10. 1371/journal.pone.0125081. [17] Szymanska E, Saccenti E, Smilde AK, et al. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies[J]. Metabolomics, 2012, 8(Suppl 1): 3-16. [18] Guan W, Zhou M, Hampton CY, et al. Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines[J]. BMC Bioinformatics, 2009, 10: 259. doi:10.1186/1471-2105-10-259. |
[1] | 王韶卿 刘毅慧 付婷婷 成金勇 刘强. 肝癌31磷磁共振波谱数据的分类[J]. 山东大学学报(医学版), 2209, 47(6): 42-46. |
[2] | 刘盈君, 张涛, 王璐, 刘佳, 常学润, 张敬悬, 薛付忠. 基于随机森林的精神分裂症血清代谢组学研究[J]. 山东大学学报(医学版), 2015, 53(2): 92-96. |
[3] | 吴庆忠1,车峰远2,薛付忠1. 基于非平衡数据的癫痫发作预警模型研究[J]. 山东大学学报(医学版), 2012, 50(2): 141-. |
[4] | 王韶卿 刘毅慧 付婷婷 成金勇 刘强. 肝癌31磷磁共振波谱数据的分类[J]. 山东大学学报(医学版), 2009, 47(6): 42-46. |
|