您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(医学版)》

山东大学学报(医学版) ›› 2016, Vol. 54 ›› Issue (4): 89-93.doi: 10.6040/j.issn.1671-7554.0.2015.1186

• • 上一篇    下一篇

代谢组学数据正态性对疾病分类准确性的影响

公晓云1,申小涛2,徐静1,张涛1,朱正江2,薛付忠1   

  1. 1.山东大学公共卫生学院生物统计学系, 山东 济南 250012;2.中国科学院上海有机化学研究所生物与化学交叉研究中心, 上海 200120
  • 收稿日期:2015-11-29 出版日期:2016-04-10 发布日期:2016-04-10
  • 通讯作者: 薛付忠. E-mail:xuefzh@sdu.edu.cn E-mail:xuefzh@sdu.edu.cn
  • 基金资助:
    山东省博士后创新项目(201302032)

Influence of normality of metabolomics data on the classification accuracy of diseases

GONG Xiaoyun1, SHEN Xiaotao2, XU Jing1, ZHANG Tao1, ZHU Zhengjiang2, XUE Fuzhong1   

  1. 1. Department of Biostatistics, School of Public Health, Shandong University, Jinan 250012, Shandong, China;
    2. Interdisciplinary Research Center on Biology and Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai 200120, China
  • Received:2015-11-29 Online:2016-04-10 Published:2016-04-10

摘要: 目的 探讨在代谢组学数据中服从正态分布的变量个数逐步增加时统计分类方法分类准确率的变化趋势。 方法 首先模拟产生11组代谢数据,且数据中服从正态分布的变量逐渐增加,然后用传统的非机器学习统计方法[Bayes判别、Fisher判别、偏最小二乘判别分析(PLS-DA)]和机器学习方法[随机森林(RF)、支持向量机(SVM)]进行统计分析,比较分类准确率的变化;最后用两个实例分析对模拟结果的合理性进行评价。 结果 代谢组学数据正态性对Bayes判别、Fisher判别、PLS-DA的分析结果影响较大,随着数据中服从正态分布的变量个数增加,分类准确率增大,而对RF和SVM基本没有影响。 结论 传统的非机器学习方法在统计分析过程中对数据正态性有一定的要求,而机器学习类的方法对数据正态性基本没有要求,且分类准确率一直保持较高的稳定状态。

关键词: 数据正态性, 分类准确率, Bayes判别, 偏最小二乘判别分析, 支持向量机, Fisher判别, 随机森林

Abstract: Objective To investigate the variation trend of the classification accuracy while the number of normal variables increases. Methods Firstly, 11 metabolomics datasets were simulated whose variables in normal distribution increased gradually. Secondly, 5 statistical methods were adopted to compare the classification accuracy, including the traditional methods: Bayes discrimination, Fisher discrimination, Partial least squares discrimination analysis(PLS-DA)and the machine learning methods: Random Forest(RF), Support Vector Machine(SVM). Lastly, the rationality of the results in simulations were evaluated with 2 sets of real data. Results The normality of matabolomics data could influence the classification accuracy of Bayes discrimination, Fisher discrimination and PLS-DA. Besides, the classification accuracy increased with the larger number of normal variables. However, the normality of data did not have discernible effect on their classification accuracy for SVM and RF. Conclusion The traditional statistical methods have requirements for the normality of metabolomics data, such as Bayes discrimination, Fisher discrimination and PLS-DA, while the machine learning methods such as SVM and RF require little of it and can produce higher and more stable classification accuracy.

Key words: Data Normality, Classification Accuracy, Bayes Discrimination, Fisher Discrimination, Random Forest, Partial Least Squares Discrimination Analysis, Support Vector Machine

中图分类号: 

  • R195.1
[1] Collino S, Martin FP, Rezzi S, et al. Clinical metabolomics paves the way towards future healthcare strategies[J]. Br J Clin Pharmacol, 2013, 75(3): 619-629.
[2] Wettersten HI, Weiss RH. Applications of metabolomics for kidney disease research: from biomarkers to therapeutic targets[J]. Organogenesis, 2013, 9(1): 11-18.
[3] Tokushige K, Hashimoto E, Kodama K, et al. Serum metabolomic profile and potential biomarkers for severity of fibrosis in nonalcoholic fatty liver disease[J]. Gastroenterol, 2013, 48(12): 1392-1400.
[4] Marengo E, Robotti E. Biomarkers for pancreatic cancer: recent achievements in proteomics and genomics through classical and multivariate statistical methods[J]. World J Gastroenterol, 2014, 20(37): 13325-13342.
[5] Xia J, Broadhurst DI, Wilson M, et al. Translational biomarker discovery in clinical metabolomics: an introductory tutorial[J]. Metabolomics, 2013, 9(2): 280-299.
[6] Wood PL. Mass spectrometry strategies for clinical metabolomics and lipidomics in psychiatry, neurology, and neuro-oncology[J]. Neuropsychopharmacology, 2014, 39(1): 24-33.
[7] Parsons HM, Ludwig C, Gunther UL, et al. Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation[J]. BMC Bioinformatics, 2007, 8: 234. doi:10.1186/1471-2105-8-234.
[8] Worley B, Powers R. Multivariate analysis in metabolomics[J]. Curr Metabolomics, 2013, 1(1): 92-107.
[9] Gu HW, Pan ZZ, Xi B, et al. Principal component directed partial least squares analysis for combining NMR and MS data in metabolomics: application to the detection of breast cancer[J]. Anal Chim Acta, 2011, 686(1-2): 57-63.
[10] Chen TL, Yu C, Zhang YN, et al. Random Forest in clinical metabolomics for phenotypic discrimination and biomarker selection[J]. Evid Based Complement Alternat Med, 2013:298183. doi:10.1155/2013/298183. Epub 2013 Feb 2.
[11] 刘盈君,张涛,王璐,等. 基于随机森林的精神分裂症血清代谢组学研究[J].山东大学学报(医学版),2015,53(2):92-96. LIU Yingjun, ZHANG Tao, WANG Lu, et al. Serum metabolic profiling of schizophrenia based on random forest[J]. Journal of Shandong University(Health Sciences), 2015, 53(2): 92-96.
[12] 王璐,张涛,刘佳,等. 模糊聚类法在动态设计组学数据趋势聚类中的应用[J].中国卫生统计,2015, 32(1): 2-5. WANG Lu, ZHANG Tao, LIU Jia, et al. Clustering the dynamic profile of dynamic omics data using soft clustering Method[J]. Chinese Journal of Health Statistics, 2015, 32(1): 2-5.
[13] Muncey HJ, Jones R, De Iorio M, et al. MetAssimulo: simulation of realistic NMR metabolic profiles[J]. BMC Bioinformatics, 2010, doi:10.1186/1471-2105-11-496.
[14] van den Berg RA, Hoefsloot HC, Westerhuis JA, et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data[J]. BMC Genomics, 2006, doi:10.1186/1471-2164-7-142.
[15] Yang J, Zhao X, Lu X, et al. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis[J]. Front Mol Biosci, 2015, 2: 4. doi: 10.3389/fmolb.2015.00004
[16] Mitchell MW. A comparison of aggregate p-value methods and multivariate statistics for self-contained tests of metabolic pathway analysis[J]. PLoS One, 2015, 10(4): e0125081. doi:10. 1371/journal.pone.0125081.
[17] Szymanska E, Saccenti E, Smilde AK, et al. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies[J]. Metabolomics, 2012, 8(Suppl 1): 3-16.
[18] Guan W, Zhou M, Hampton CY, et al. Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines[J]. BMC Bioinformatics, 2009, 10: 259. doi:10.1186/1471-2105-10-259.
[1] 王韶卿 刘毅慧 付婷婷 成金勇 刘强. 肝癌31磷磁共振波谱数据的分类[J]. 山东大学学报(医学版), 2209, 47(6): 42-46.
[2] 刘盈君, 张涛, 王璐, 刘佳, 常学润, 张敬悬, 薛付忠. 基于随机森林的精神分裂症血清代谢组学研究[J]. 山东大学学报(医学版), 2015, 53(2): 92-96.
[3] 吴庆忠1,车峰远2,薛付忠1. 基于非平衡数据的癫痫发作预警模型研究[J]. 山东大学学报(医学版), 2012, 50(2): 141-.
[4] 王韶卿 刘毅慧 付婷婷 成金勇 刘强. 肝癌31磷磁共振波谱数据的分类[J]. 山东大学学报(医学版), 2009, 47(6): 42-46.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!