JOURNAL OF SHANDONG UNIVERSITY (HEALTH SCIENCES) ›› 2016, Vol. 54 ›› Issue (4): 89-93.doi: 10.6040/j.issn.1671-7554.0.2015.1186

Previous Articles     Next Articles

Influence of normality of metabolomics data on the classification accuracy of diseases

GONG Xiaoyun1, SHEN Xiaotao2, XU Jing1, ZHANG Tao1, ZHU Zhengjiang2, XUE Fuzhong1   

  1. 1. Department of Biostatistics, School of Public Health, Shandong University, Jinan 250012, Shandong, China;
    2. Interdisciplinary Research Center on Biology and Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai 200120, China
  • Received:2015-11-29 Online:2016-04-10 Published:2016-04-10

Abstract: Objective To investigate the variation trend of the classification accuracy while the number of normal variables increases. Methods Firstly, 11 metabolomics datasets were simulated whose variables in normal distribution increased gradually. Secondly, 5 statistical methods were adopted to compare the classification accuracy, including the traditional methods: Bayes discrimination, Fisher discrimination, Partial least squares discrimination analysis(PLS-DA)and the machine learning methods: Random Forest(RF), Support Vector Machine(SVM). Lastly, the rationality of the results in simulations were evaluated with 2 sets of real data. Results The normality of matabolomics data could influence the classification accuracy of Bayes discrimination, Fisher discrimination and PLS-DA. Besides, the classification accuracy increased with the larger number of normal variables. However, the normality of data did not have discernible effect on their classification accuracy for SVM and RF. Conclusion The traditional statistical methods have requirements for the normality of metabolomics data, such as Bayes discrimination, Fisher discrimination and PLS-DA, while the machine learning methods such as SVM and RF require little of it and can produce higher and more stable classification accuracy.

Key words: Data Normality, Classification Accuracy, Bayes Discrimination, Fisher Discrimination, Random Forest, Partial Least Squares Discrimination Analysis, Support Vector Machine

CLC Number: 

  • R195.1
[1] Collino S, Martin FP, Rezzi S, et al. Clinical metabolomics paves the way towards future healthcare strategies[J]. Br J Clin Pharmacol, 2013, 75(3): 619-629.
[2] Wettersten HI, Weiss RH. Applications of metabolomics for kidney disease research: from biomarkers to therapeutic targets[J]. Organogenesis, 2013, 9(1): 11-18.
[3] Tokushige K, Hashimoto E, Kodama K, et al. Serum metabolomic profile and potential biomarkers for severity of fibrosis in nonalcoholic fatty liver disease[J]. Gastroenterol, 2013, 48(12): 1392-1400.
[4] Marengo E, Robotti E. Biomarkers for pancreatic cancer: recent achievements in proteomics and genomics through classical and multivariate statistical methods[J]. World J Gastroenterol, 2014, 20(37): 13325-13342.
[5] Xia J, Broadhurst DI, Wilson M, et al. Translational biomarker discovery in clinical metabolomics: an introductory tutorial[J]. Metabolomics, 2013, 9(2): 280-299.
[6] Wood PL. Mass spectrometry strategies for clinical metabolomics and lipidomics in psychiatry, neurology, and neuro-oncology[J]. Neuropsychopharmacology, 2014, 39(1): 24-33.
[7] Parsons HM, Ludwig C, Gunther UL, et al. Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation[J]. BMC Bioinformatics, 2007, 8: 234. doi:10.1186/1471-2105-8-234.
[8] Worley B, Powers R. Multivariate analysis in metabolomics[J]. Curr Metabolomics, 2013, 1(1): 92-107.
[9] Gu HW, Pan ZZ, Xi B, et al. Principal component directed partial least squares analysis for combining NMR and MS data in metabolomics: application to the detection of breast cancer[J]. Anal Chim Acta, 2011, 686(1-2): 57-63.
[10] Chen TL, Yu C, Zhang YN, et al. Random Forest in clinical metabolomics for phenotypic discrimination and biomarker selection[J]. Evid Based Complement Alternat Med, 2013:298183. doi:10.1155/2013/298183. Epub 2013 Feb 2.
[11] 刘盈君,张涛,王璐,等. 基于随机森林的精神分裂症血清代谢组学研究[J].山东大学学报(医学版),2015,53(2):92-96. LIU Yingjun, ZHANG Tao, WANG Lu, et al. Serum metabolic profiling of schizophrenia based on random forest[J]. Journal of Shandong University(Health Sciences), 2015, 53(2): 92-96.
[12] 王璐,张涛,刘佳,等. 模糊聚类法在动态设计组学数据趋势聚类中的应用[J].中国卫生统计,2015, 32(1): 2-5. WANG Lu, ZHANG Tao, LIU Jia, et al. Clustering the dynamic profile of dynamic omics data using soft clustering Method[J]. Chinese Journal of Health Statistics, 2015, 32(1): 2-5.
[13] Muncey HJ, Jones R, De Iorio M, et al. MetAssimulo: simulation of realistic NMR metabolic profiles[J]. BMC Bioinformatics, 2010, doi:10.1186/1471-2105-11-496.
[14] van den Berg RA, Hoefsloot HC, Westerhuis JA, et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data[J]. BMC Genomics, 2006, doi:10.1186/1471-2164-7-142.
[15] Yang J, Zhao X, Lu X, et al. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis[J]. Front Mol Biosci, 2015, 2: 4. doi: 10.3389/fmolb.2015.00004
[16] Mitchell MW. A comparison of aggregate p-value methods and multivariate statistics for self-contained tests of metabolic pathway analysis[J]. PLoS One, 2015, 10(4): e0125081. doi:10. 1371/journal.pone.0125081.
[17] Szymanska E, Saccenti E, Smilde AK, et al. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies[J]. Metabolomics, 2012, 8(Suppl 1): 3-16.
[18] Guan W, Zhou M, Hampton CY, et al. Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines[J]. BMC Bioinformatics, 2009, 10: 259. doi:10.1186/1471-2105-10-259.
[1] WANG Shao-Qing, LIU Yi-Hui, FU Ting-Ting, CHENG Jin-Yong, LIU Qiang. Classification of 31P MRS data for hepatocellular carcinoma [J]. JOURNAL OF SHANDONG UNIVERSITY (HEALTH SCIENCES), 2209, 47(6): 42-46.
[2] LIU Yingjun, ZHANG Tao, WANG Lu, LIU Jia, CHANG Xuerun, ZHANG Jingxuan, XUE Fuzhong. Serum metabolic profiling of schizophrenia based on random forest [J]. JOURNAL OF SHANDONG UNIVERSITY (HEALTH SCIENCES), 2015, 53(2): 92-96.
[3] WU Qing-zhong1, CHE Feng-yuan2, XUE Fu-zhong1. A predictive model of epileptic
seizures based on unbalanced data
[J]. JOURNAL OF SHANDONG UNIVERSITY (HEALTH SCIENCES), 2012, 50(2): 141-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!