Basic validation procedures for regression models in QSAR and QSPR studies: theory and application

Kiralj,Rudolf; Ferreira,Márcia M. C.

Basic validation procedures for regression models in QSAR and QSPR studies: theory and application

Detalhes bibliográficos
Autor(a) principal:	Kiralj,Rudolf
Data de Publicação:	2009
Outros Autores:	Ferreira,Márcia M. C.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Journal of the Brazilian Chemical Society (Online)
Texto Completo:	http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021
Resumo:	Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.

Metadados do item

id	SBQ-2_ab9bb9a821ce0e284f729cdc7253123b
oai_identifier_str	oai:scielo:S0103-50532009000400021
network_acronym_str	SBQ-2
network_name_str	Journal of the Brazilian Chemical Society (Online)
repository_id_str
spelling	Basic validation procedures for regression models in QSAR and QSPR studies: theory and applicationleave-one-out crossvalidationleave-N-out crossvalidationy-randomizationexternal validationbootstrappingFour quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.Sociedade Brasileira de Química2009-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021Journal of the Brazilian Chemical Society v.20 n.4 2009reponame:Journal of the Brazilian Chemical Society (Online)instname:Sociedade Brasileira de Química (SBQ)instacron:SBQ10.1590/S0103-50532009000400021info:eu-repo/semantics/openAccessKiralj,RudolfFerreira,Márcia M. C.eng2009-06-10T00:00:00Zoai:scielo:S0103-50532009000400021Revistahttp://jbcs.sbq.org.brONGhttps://old.scielo.br/oai/scielo-oai.php\|\|office@jbcs.sbq.org.br1678-47900103-5053opendoar:2009-06-10T00:00Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)false
dc.title.none.fl_str_mv	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
spellingShingle	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application Kiralj,Rudolf leave-one-out crossvalidation leave-N-out crossvalidation y-randomization external validation bootstrapping
title_short	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_full	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_fullStr	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_full_unstemmed	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
title_sort	Basic validation procedures for regression models in QSAR and QSPR studies: theory and application
author	Kiralj,Rudolf
author_facet	Kiralj,Rudolf Ferreira,Márcia M. C.
author_role	author
author2	Ferreira,Márcia M. C.
author2_role	author
dc.contributor.author.fl_str_mv	Kiralj,Rudolf Ferreira,Márcia M. C.
dc.subject.por.fl_str_mv	leave-one-out crossvalidation leave-N-out crossvalidation y-randomization external validation bootstrapping
topic	leave-one-out crossvalidation leave-N-out crossvalidation y-randomization external validation bootstrapping
description	Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.
publishDate	2009
dc.date.none.fl_str_mv	2009-01-01
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021
url	http://old.scielo.br/scielo.php?script=sci_arttext&pid=S0103-50532009000400021
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	10.1590/S0103-50532009000400021
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	text/html
dc.publisher.none.fl_str_mv	Sociedade Brasileira de Química
publisher.none.fl_str_mv	Sociedade Brasileira de Química
dc.source.none.fl_str_mv	Journal of the Brazilian Chemical Society v.20 n.4 2009 reponame:Journal of the Brazilian Chemical Society (Online) instname:Sociedade Brasileira de Química (SBQ) instacron:SBQ
instname_str	Sociedade Brasileira de Química (SBQ)
instacron_str	SBQ
institution	SBQ
reponame_str	Journal of the Brazilian Chemical Society (Online)
collection	Journal of the Brazilian Chemical Society (Online)
repository.name.fl_str_mv	Journal of the Brazilian Chemical Society (Online) - Sociedade Brasileira de Química (SBQ)
repository.mail.fl_str_mv	\|\|office@jbcs.sbq.org.br
_version_	1750318169835175936

Basic validation procedures for regression models in QSAR and QSPR studies: theory and application

Registros relacionados