Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
Autor(a) principal: | |
---|---|
Data de Publicação: | 2019 |
Outros Autores: | , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Journal of the Brazilian Chemical Society (Online) |
Texto Completo: | http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279 |
Resumo: | Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds. |
id |
SBQ-2_7b0bc76029603e82c352bef72bdfa9d7 |
---|---|
oai_identifier_str |
oai:scielo:S0103-50532019000200279 |
network_acronym_str |
SBQ-2 |
network_name_str |
Journal of the Brazilian Chemical Society (Online) |
repository_id_str |
|
spelling |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporariaalcohol compoundsRana temporariafeature selectionsupport vector regression (SVR)qualitative structure-activity relationship (QSAR)Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.Sociedade Brasileira de Química2019-02-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279Journal of the Brazilian Chemical Society v.30 n.2 2019reponame:Journal of the Brazilian Chemical Society (Online)instname:Sociedade Brasileira de Química (SBQ)instacron:SBQ10.21577/0103-5053.20180176info:eu-repo/semantics/openAccessWang,LifengXing,PengweiWang,CongZhou,XiaomaoDai,ZhijunBai,Lianyangeng2019-01-14T00:00:00Zoai:scielo:S0103-50532019000200279Revistahttp://jbcs.sbq.org.brONGhttps://old.scielo.br/oai/scielo-oai.php||office@jbcs.sbq.org.br1678-47900103-5053opendoar:2019-01-14T00:00Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)false |
dc.title.none.fl_str_mv |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
title |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
spellingShingle |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria Wang,Lifeng alcohol compounds Rana temporaria feature selection support vector regression (SVR) qualitative structure-activity relationship (QSAR) |
title_short |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
title_full |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
title_fullStr |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
title_full_unstemmed |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
title_sort |
Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria |
author |
Wang,Lifeng |
author_facet |
Wang,Lifeng Xing,Pengwei Wang,Cong Zhou,Xiaomao Dai,Zhijun Bai,Lianyang |
author_role |
author |
author2 |
Xing,Pengwei Wang,Cong Zhou,Xiaomao Dai,Zhijun Bai,Lianyang |
author2_role |
author author author author author |
dc.contributor.author.fl_str_mv |
Wang,Lifeng Xing,Pengwei Wang,Cong Zhou,Xiaomao Dai,Zhijun Bai,Lianyang |
dc.subject.por.fl_str_mv |
alcohol compounds Rana temporaria feature selection support vector regression (SVR) qualitative structure-activity relationship (QSAR) |
topic |
alcohol compounds Rana temporaria feature selection support vector regression (SVR) qualitative structure-activity relationship (QSAR) |
description |
Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds. |
publishDate |
2019 |
dc.date.none.fl_str_mv |
2019-02-01 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279 |
url |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532019000200279 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
10.21577/0103-5053.20180176 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
text/html |
dc.publisher.none.fl_str_mv |
Sociedade Brasileira de Química |
publisher.none.fl_str_mv |
Sociedade Brasileira de Química |
dc.source.none.fl_str_mv |
Journal of the Brazilian Chemical Society v.30 n.2 2019 reponame:Journal of the Brazilian Chemical Society (Online) instname:Sociedade Brasileira de Química (SBQ) instacron:SBQ |
instname_str |
Sociedade Brasileira de Química (SBQ) |
instacron_str |
SBQ |
institution |
SBQ |
reponame_str |
Journal of the Brazilian Chemical Society (Online) |
collection |
Journal of the Brazilian Chemical Society (Online) |
repository.name.fl_str_mv |
Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ) |
repository.mail.fl_str_mv |
||office@jbcs.sbq.org.br |
_version_ |
1750318181344346112 |