Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria

Detalhes bibliográficos
Autor(a) principal: Wang,Lifeng
Data de Publicação: 2019
Outros Autores: Xing,Pengwei, Wang,Cong, Zhou,Xiaomao, Dai,Zhijun, Bai,Lianyang
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Journal of the Brazilian Chemical Society (Online)
Texto Completo: http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279
Resumo: Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.
id SBQ-2_7b0bc76029603e82c352bef72bdfa9d7
oai_identifier_str oai:scielo:S0103-50532019000200279
network_acronym_str SBQ-2
network_name_str Journal of the Brazilian Chemical Society (Online)
repository_id_str
spelling Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporariaalcohol compoundsRana temporariafeature selectionsupport vector regression (SVR)qualitative structure-activity relationship (QSAR)Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.Sociedade Brasileira de Química2019-02-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279Journal of the Brazilian Chemical Society v.30 n.2 2019reponame:Journal of the Brazilian Chemical Society (Online)instname:Sociedade Brasileira de Química (SBQ)instacron:SBQ10.21577/0103-5053.20180176info:eu-repo/semantics/openAccessWang,LifengXing,PengweiWang,CongZhou,XiaomaoDai,ZhijunBai,Lianyangeng2019-01-14T00:00:00Zoai:scielo:S0103-50532019000200279Revistahttp://jbcs.sbq.org.brONGhttps://old.scielo.br/oai/scielo-oai.php||office@jbcs.sbq.org.br1678-47900103-5053opendoar:2019-01-14T00:00Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)false
dc.title.none.fl_str_mv Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
title Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
spellingShingle Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
Wang,Lifeng
alcohol compounds
Rana temporaria
feature selection
support vector regression (SVR)
qualitative structure-activity relationship (QSAR)
title_short Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
title_full Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
title_fullStr Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
title_full_unstemmed Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
title_sort Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
author Wang,Lifeng
author_facet Wang,Lifeng
Xing,Pengwei
Wang,Cong
Zhou,Xiaomao
Dai,Zhijun
Bai,Lianyang
author_role author
author2 Xing,Pengwei
Wang,Cong
Zhou,Xiaomao
Dai,Zhijun
Bai,Lianyang
author2_role author
author
author
author
author
dc.contributor.author.fl_str_mv Wang,Lifeng
Xing,Pengwei
Wang,Cong
Zhou,Xiaomao
Dai,Zhijun
Bai,Lianyang
dc.subject.por.fl_str_mv alcohol compounds
Rana temporaria
feature selection
support vector regression (SVR)
qualitative structure-activity relationship (QSAR)
topic alcohol compounds
Rana temporaria
feature selection
support vector regression (SVR)
qualitative structure-activity relationship (QSAR)
description Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.
publishDate 2019
dc.date.none.fl_str_mv 2019-02-01
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279
url http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 10.21577/0103-5053.20180176
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv text/html
dc.publisher.none.fl_str_mv Sociedade Brasileira de Química
publisher.none.fl_str_mv Sociedade Brasileira de Química
dc.source.none.fl_str_mv Journal of the Brazilian Chemical Society v.30 n.2 2019
reponame:Journal of the Brazilian Chemical Society (Online)
instname:Sociedade Brasileira de Química (SBQ)
instacron:SBQ
instname_str Sociedade Brasileira de Química (SBQ)
instacron_str SBQ
institution SBQ
reponame_str Journal of the Brazilian Chemical Society (Online)
collection Journal of the Brazilian Chemical Society (Online)
repository.name.fl_str_mv Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)
repository.mail.fl_str_mv ||office@jbcs.sbq.org.br
_version_ 1750318181344346112