Collect. Czech. Chem. Commun. 2011, 76, 243-264
Published online 2011-03-08 09:10:42

Use of advanced statistical learning methods and principal component analysis in quantitative structure–genotoxicity relationship study of amines

Yueying Rena,b,*, Baowei Zhaoa,b and Xiaojun Yaoc

a Engineering Research Center for Cold and Arid Regions Water Resource Comprehensive Utilization, Ministry of Education, Lanzhou 730070, China
b School of Environmental and Municipal Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
c Department of Chemistry, Lanzhou University, Lanzhou 730000, China


The paper highlighted the use of advanced nonlinear modeling and subset selection techniques in the construction of a good, predictive model for genotoxicity study of amines. Essentials accounting for a reliable model were all considered carefully. Chemicals were represented by a large number of CODESSA descriptors. Division of a whole sample into the training set and the test set was performed by principal component analysis (PCA). Six descriptors selected by the best multi-linear regression (BMLR) method in CODESSA program were used as inputs to build nonlinear models, using advanced statistical learning methods such as support vector machine (SVM) and projection pursuit regression (PPR). The models were validated through three ways, i.e. internal cross-validation (CV), a test set and an independent validation set. Analysis shows that nonlinear models produced better results than linear models and PPR model outperforms the rest in the following order: PPR > SVM > linear SVM ≥ BMLR. In addition, the relationships between the descriptors and the mutagenic behavior of compounds are well discussed.

Keywords: Quantitative structure–genotoxicity relationship; Amine; Principal component analysis; Support vector machine; Projection pursuit regression.

References: 51 live references.