Basic validation procedures for regression models in QSAR and QSPR studies: theory and application

Detalhes bibliográficos
Autor(a) principal: Kiralj,Rudolf
Data de Publicação: 2009
Outros Autores: Ferreira,Márcia M. C.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Journal of the Brazilian Chemical Society (Online)
Texto Completo: http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021
Resumo: Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.
id SBQ-2_ab9bb9a821ce0e284f729cdc7253123b
oai_identifier_str oai:scielo:S0103-50532009000400021
network_acronym_str SBQ-2
network_name_str Journal of the Brazilian Chemical Society (Online)
repository_id_str
spelling Basic validation procedures for regression models in QSAR and QSPR studies: theory and applicationleave-one-out crossvalidationleave-N-out crossvalidationy-randomizationexternal validationbootstrappingFour quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.Sociedade Brasileira de Química2009-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021Journal of the Brazilian Chemical Society v.20 n.4 2009reponame:Journal of the Brazilian Chemical Society (Online)instname:Sociedade Brasileira de Química (SBQ)instacron:SBQ10.1590/S0103-50532009000400021info:eu-repo/semantics/openAccessKiralj,RudolfFerreira,Márcia M. C.eng2009-06-10T00:00:00Zoai:scielo:S0103-50532009000400021Revistahttp://jbcs.sbq.org.brONGhttps://old.scielo.br/oai/scielo-oai.php||office@jbcs.sbq.org.br1678-47900103-5053opendoar:2009-06-10T00:00Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)false
dc.title.none.fl_str_mv Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
spellingShingle Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
Kiralj,Rudolf
leave-one-out crossvalidation
leave-N-out crossvalidation
y-randomization
external validation
bootstrapping
title_short Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_full Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_fullStr Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_full_unstemmed Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_sort Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
author Kiralj,Rudolf
author_facet Kiralj,Rudolf
Ferreira,Márcia M. C.
author_role author
author2 Ferreira,Márcia M. C.
author2_role author
dc.contributor.author.fl_str_mv Kiralj,Rudolf
Ferreira,Márcia M. C.
dc.subject.por.fl_str_mv leave-one-out crossvalidation
leave-N-out crossvalidation
y-randomization
external validation
bootstrapping
topic leave-one-out crossvalidation
leave-N-out crossvalidation
y-randomization
external validation
bootstrapping
description Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.
publishDate 2009
dc.date.none.fl_str_mv 2009-01-01
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021
url http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 10.1590/S0103-50532009000400021
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv text/html
dc.publisher.none.fl_str_mv Sociedade Brasileira de Química
publisher.none.fl_str_mv Sociedade Brasileira de Química
dc.source.none.fl_str_mv Journal of the Brazilian Chemical Society v.20 n.4 2009
reponame:Journal of the Brazilian Chemical Society (Online)
instname:Sociedade Brasileira de Química (SBQ)
instacron:SBQ
instname_str Sociedade Brasileira de Química (SBQ)
instacron_str SBQ
institution SBQ
reponame_str Journal of the Brazilian Chemical Society (Online)
collection Journal of the Brazilian Chemical Society (Online)
repository.name.fl_str_mv Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)
repository.mail.fl_str_mv ||office@jbcs.sbq.org.br
_version_ 1750318169835175936