Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
Autor(a) principal: | |
---|---|
Data de Publicação: | 2009 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Journal of the Brazilian Chemical Society (Online) |
Texto Completo: | http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021 |
Resumo: | Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets. |
id |
SBQ-2_ab9bb9a821ce0e284f729cdc7253123b |
---|---|
oai_identifier_str |
oai:scielo:S0103-50532009000400021 |
network_acronym_str |
SBQ-2 |
network_name_str |
Journal of the Brazilian Chemical Society (Online) |
repository_id_str |
|
spelling |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and applicationleave-one-out crossvalidationleave-N-out crossvalidationy-randomizationexternal validationbootstrappingFour quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.Sociedade Brasileira de Química2009-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021Journal of the Brazilian Chemical Society v.20 n.4 2009reponame:Journal of the Brazilian Chemical Society (Online)instname:Sociedade Brasileira de Química (SBQ)instacron:SBQ10.1590/S0103-50532009000400021info:eu-repo/semantics/openAccessKiralj,RudolfFerreira,Márcia M. C.eng2009-06-10T00:00:00Zoai:scielo:S0103-50532009000400021Revistahttp://jbcs.sbq.org.brONGhttps://old.scielo.br/oai/scielo-oai.php||office@jbcs.sbq.org.br1678-47900103-5053opendoar:2009-06-10T00:00Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)false |
dc.title.none.fl_str_mv |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
title |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
spellingShingle |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application Kiralj,Rudolf leave-one-out crossvalidation leave-N-out crossvalidation y-randomization external validation bootstrapping |
title_short |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
title_full |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
title_fullStr |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
title_full_unstemmed |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
title_sort |
Basic validation procedures for regression models in QSAR and QSPR studies: theory and application |
author |
Kiralj,Rudolf |
author_facet |
Kiralj,Rudolf Ferreira,Márcia M. C. |
author_role |
author |
author2 |
Ferreira,Márcia M. C. |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Kiralj,Rudolf Ferreira,Márcia M. C. |
dc.subject.por.fl_str_mv |
leave-one-out crossvalidation leave-N-out crossvalidation y-randomization external validation bootstrapping |
topic |
leave-one-out crossvalidation leave-N-out crossvalidation y-randomization external validation bootstrapping |
description |
Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets. |
publishDate |
2009 |
dc.date.none.fl_str_mv |
2009-01-01 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021 |
url |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
10.1590/S0103-50532009000400021 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
text/html |
dc.publisher.none.fl_str_mv |
Sociedade Brasileira de Química |
publisher.none.fl_str_mv |
Sociedade Brasileira de Química |
dc.source.none.fl_str_mv |
Journal of the Brazilian Chemical Society v.20 n.4 2009 reponame:Journal of the Brazilian Chemical Society (Online) instname:Sociedade Brasileira de Química (SBQ) instacron:SBQ |
instname_str |
Sociedade Brasileira de Química (SBQ) |
instacron_str |
SBQ |
institution |
SBQ |
reponame_str |
Journal of the Brazilian Chemical Society (Online) |
collection |
Journal of the Brazilian Chemical Society (Online) |
repository.name.fl_str_mv |
Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ) |
repository.mail.fl_str_mv |
||office@jbcs.sbq.org.br |
_version_ |
1750318169835175936 |