Entropy, mutual information, and population structure in genome-wide selection
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | LOCUS Repositório Institucional da UFV |
Texto Completo: | https://locus.ufv.br//handle/123456789/28716 |
Resumo: | Different populations can compose the training set aiming for a better predictive ability of genomic prediction models. However, this practice has not always resulted in higher predictive ability and some studies have proposed to account population structure effect for a better prediction. Different strategies like principal components covariates, uni and multi-population models, alternative genomic relationships matrices, admixed proportions covariates, or a mix of them have been applied to genomic prediction. Thus, the first chapter aims to evaluate some combinations of these strategies to help the decision making about considering or not considering population structure on genomic prediction. Simulated polygenic traits with 0.1 and 0.5 heritability and real data were used to evaluated the strategies. Bias was lower, when multi-population model was used for low-heritability simulated trait. The accuracy of high-heritability trait was lower for strategies that used alternative genomic matrices that accounted for differences in allele frequency, only in admixed populations. Further, for real data, two commonly used genomic relationship matrices showed lower values of predictive ability for all traits, which are likely controlled by few quantitative trait loci. Therefore, accounting for population structure depends on trait heritability, trait architecture, and admixture level of population for obtaining lower bias without reduction of accuracy, and, consequently, success of genomic prediction. The second chapter address the fact that random k-fold cross-validation in genome wide selection can provide high estimates of predictive ability, due to the high degree of kinship between the training and validation sets. However, many breeding tree populations are less genetically related to the training sets and have different levels of phenotypic diversity. Therefore, this chapter proposed novel methods of splitting cross-validation sets, accounting genetic similarity and phenotypic diversity estimated via mutual information and entropy, respectively. These methods also verified how distribution of phenotypic and genotypic information affects genome wide selection of trees. The methods trustworthily fitted models, according to the entropy of tree breeding populations and their genetic relatedness to the training sets. Validations sets with more phenotypic diversity showed higher predictive ability and lower bias. Therefore, the phenotypic diversity should be added in tree breeding populations for higher genetic gain and better estimation of genomic breeding values and a consistent long-term tree breeding success. Keywords: Population structure. Accuracy. Bias. Mutual information. Entropy. K-fold cross-validation. |
id |
UFV_7516e8cb6ab5798e909753c274f61ead |
---|---|
oai_identifier_str |
oai:locus.ufv.br:123456789/28716 |
network_acronym_str |
UFV |
network_name_str |
LOCUS Repositório Institucional da UFV |
repository_id_str |
2145 |
spelling |
Entropy, mutual information, and population structure in genome-wide selectionEntropia, informação mútua e estrutura de populações na seleção genômica amplaGenômicaMelhoramento genéticoEstrutura populacionalEntropiaPrediçãoGenética QuantitativaDifferent populations can compose the training set aiming for a better predictive ability of genomic prediction models. However, this practice has not always resulted in higher predictive ability and some studies have proposed to account population structure effect for a better prediction. Different strategies like principal components covariates, uni and multi-population models, alternative genomic relationships matrices, admixed proportions covariates, or a mix of them have been applied to genomic prediction. Thus, the first chapter aims to evaluate some combinations of these strategies to help the decision making about considering or not considering population structure on genomic prediction. Simulated polygenic traits with 0.1 and 0.5 heritability and real data were used to evaluated the strategies. Bias was lower, when multi-population model was used for low-heritability simulated trait. The accuracy of high-heritability trait was lower for strategies that used alternative genomic matrices that accounted for differences in allele frequency, only in admixed populations. Further, for real data, two commonly used genomic relationship matrices showed lower values of predictive ability for all traits, which are likely controlled by few quantitative trait loci. Therefore, accounting for population structure depends on trait heritability, trait architecture, and admixture level of population for obtaining lower bias without reduction of accuracy, and, consequently, success of genomic prediction. The second chapter address the fact that random k-fold cross-validation in genome wide selection can provide high estimates of predictive ability, due to the high degree of kinship between the training and validation sets. However, many breeding tree populations are less genetically related to the training sets and have different levels of phenotypic diversity. Therefore, this chapter proposed novel methods of splitting cross-validation sets, accounting genetic similarity and phenotypic diversity estimated via mutual information and entropy, respectively. These methods also verified how distribution of phenotypic and genotypic information affects genome wide selection of trees. The methods trustworthily fitted models, according to the entropy of tree breeding populations and their genetic relatedness to the training sets. Validations sets with more phenotypic diversity showed higher predictive ability and lower bias. Therefore, the phenotypic diversity should be added in tree breeding populations for higher genetic gain and better estimation of genomic breeding values and a consistent long-term tree breeding success. Keywords: Population structure. Accuracy. Bias. Mutual information. Entropy. K-fold cross-validation.Na predição genômica, diferentes populações podem compor o conjunto de treinamento para melhorar a capacidade preditiva. Entretanto, esta prática não tem resultado em maiores capacidades preditivas e alguns estudos propuseram acomodar o efeito de estrutura populacional para melhor predição. Diferentes estratégias como componentes principais, modelos uni e multipopulacionais, matrizes alternativas de parentesco genômico, proporção de indivíduos misturados ou uma mistura destas estratégias tem sido empregada na predição genômica. Portanto, o objetivo deste primeiro capítulo foi avaliar algumas combinações destas estratégias para ajudar no processo de decisão sobre considerar ou não o efeito de estrutura populacional na predição genômica. Duas características poligênicas foram simuladas com herdabilidade de 0,1 e 0,5 e dados reais foram utilizados na avaliação. O viés de predição foi menor quando modelos multipopulacionais foram empregados para característica simulada de baixa herdabilidade. A acurácia da característica com alta herdabilidade (0,5) em populações misturadas foi baixa para estratégias que utilizaram matrizes de parentesco genômico que consideravam diferenças na frequência alélica. Além disso, nos dados reais, duas matrizes alternativas de parentesco genômico apresentaram baixa capacidade preditiva para as características avaliadas, as quais são provavelmente governadas por poucos loci. Portanto, a acomodação de estrutura populacional depende da arquitetura genética da característica, da herdabilidade e do nível de mistura da população para obtenção de menor viés sem reduzir a acurácia e, consequentemente, sucesso da predição genômica. O segundo capítulo aborda a validação cruzada na seleção genômica ampla. Esta validação quando feita aleatoriamente ocasiona em altos valores das estimativas de capacidade preditiva, provavelmente, devido ao alto grau de parentesco entre os conjuntos de treinamento e validação. No entanto, muitas populações de melhoramento florestal são fracamente relacionadas geneticamente com os conjuntos de treinamento e possuem diferentes níveis de diversidade fenotípica. Portanto, este capítulo propôs novos métodos de separação dos conjuntos de validação cruzada, considerando a similaridade genética e a diversidade fenotípica, obtidas por meio da informação mútua e entropia, respectivamente. Esses novos métodos também verificaram como a distribuição das informações fenotípicas e genotípicas afeta a seleção genômica ampla de espécies florestais. Os novos métodos ajustaram modelos mais confiáveis e que estão de acordo com a entropia das populações de melhoramento e sua relação genética com os conjuntos de treinamento. Os conjuntos de validação com maior diversidade fenotípica apresentaram maior capacidade preditiva e menor viés. Portanto, a diversidade fenotípica deve ser adicionada nas populações de melhoramento para maior ganho genético e melhor estimativa dos valores genéticos genômicos. Palavras-chave: Estrutura populacional. Acurácia. Viés. Informação mútua. Entropia. Validação cruzada.Universidade Federal de ViçosaResende, Marcos Deon Vilela dehttp://lattes.cnpq.br/3748640680505163Simiqueli, Guilherme Ferreira2022-03-04T13:28:01Z2022-03-04T13:28:01Z2020-07-23info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfSIMIQUELI, Guilherme Ferreira. Entropy, mutual information, and population structure in genome-wide selection. 2020. 122 f. Tese (Doutorado em Genética e Melhoramento) - Universidade Federal de Viçosa, Viçosa. 2020.https://locus.ufv.br//handle/123456789/28716enginfo:eu-repo/semantics/openAccessreponame:LOCUS Repositório Institucional da UFVinstname:Universidade Federal de Viçosa (UFV)instacron:UFV2024-07-12T08:42:25Zoai:locus.ufv.br:123456789/28716Repositório InstitucionalPUBhttps://www.locus.ufv.br/oai/requestfabiojreis@ufv.bropendoar:21452024-07-12T08:42:25LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV)false |
dc.title.none.fl_str_mv |
Entropy, mutual information, and population structure in genome-wide selection Entropia, informação mútua e estrutura de populações na seleção genômica ampla |
title |
Entropy, mutual information, and population structure in genome-wide selection |
spellingShingle |
Entropy, mutual information, and population structure in genome-wide selection Simiqueli, Guilherme Ferreira Genômica Melhoramento genético Estrutura populacional Entropia Predição Genética Quantitativa |
title_short |
Entropy, mutual information, and population structure in genome-wide selection |
title_full |
Entropy, mutual information, and population structure in genome-wide selection |
title_fullStr |
Entropy, mutual information, and population structure in genome-wide selection |
title_full_unstemmed |
Entropy, mutual information, and population structure in genome-wide selection |
title_sort |
Entropy, mutual information, and population structure in genome-wide selection |
author |
Simiqueli, Guilherme Ferreira |
author_facet |
Simiqueli, Guilherme Ferreira |
author_role |
author |
dc.contributor.none.fl_str_mv |
Resende, Marcos Deon Vilela de http://lattes.cnpq.br/3748640680505163 |
dc.contributor.author.fl_str_mv |
Simiqueli, Guilherme Ferreira |
dc.subject.por.fl_str_mv |
Genômica Melhoramento genético Estrutura populacional Entropia Predição Genética Quantitativa |
topic |
Genômica Melhoramento genético Estrutura populacional Entropia Predição Genética Quantitativa |
description |
Different populations can compose the training set aiming for a better predictive ability of genomic prediction models. However, this practice has not always resulted in higher predictive ability and some studies have proposed to account population structure effect for a better prediction. Different strategies like principal components covariates, uni and multi-population models, alternative genomic relationships matrices, admixed proportions covariates, or a mix of them have been applied to genomic prediction. Thus, the first chapter aims to evaluate some combinations of these strategies to help the decision making about considering or not considering population structure on genomic prediction. Simulated polygenic traits with 0.1 and 0.5 heritability and real data were used to evaluated the strategies. Bias was lower, when multi-population model was used for low-heritability simulated trait. The accuracy of high-heritability trait was lower for strategies that used alternative genomic matrices that accounted for differences in allele frequency, only in admixed populations. Further, for real data, two commonly used genomic relationship matrices showed lower values of predictive ability for all traits, which are likely controlled by few quantitative trait loci. Therefore, accounting for population structure depends on trait heritability, trait architecture, and admixture level of population for obtaining lower bias without reduction of accuracy, and, consequently, success of genomic prediction. The second chapter address the fact that random k-fold cross-validation in genome wide selection can provide high estimates of predictive ability, due to the high degree of kinship between the training and validation sets. However, many breeding tree populations are less genetically related to the training sets and have different levels of phenotypic diversity. Therefore, this chapter proposed novel methods of splitting cross-validation sets, accounting genetic similarity and phenotypic diversity estimated via mutual information and entropy, respectively. These methods also verified how distribution of phenotypic and genotypic information affects genome wide selection of trees. The methods trustworthily fitted models, according to the entropy of tree breeding populations and their genetic relatedness to the training sets. Validations sets with more phenotypic diversity showed higher predictive ability and lower bias. Therefore, the phenotypic diversity should be added in tree breeding populations for higher genetic gain and better estimation of genomic breeding values and a consistent long-term tree breeding success. Keywords: Population structure. Accuracy. Bias. Mutual information. Entropy. K-fold cross-validation. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-07-23 2022-03-04T13:28:01Z 2022-03-04T13:28:01Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
SIMIQUELI, Guilherme Ferreira. Entropy, mutual information, and population structure in genome-wide selection. 2020. 122 f. Tese (Doutorado em Genética e Melhoramento) - Universidade Federal de Viçosa, Viçosa. 2020. https://locus.ufv.br//handle/123456789/28716 |
identifier_str_mv |
SIMIQUELI, Guilherme Ferreira. Entropy, mutual information, and population structure in genome-wide selection. 2020. 122 f. Tese (Doutorado em Genética e Melhoramento) - Universidade Federal de Viçosa, Viçosa. 2020. |
url |
https://locus.ufv.br//handle/123456789/28716 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Viçosa |
publisher.none.fl_str_mv |
Universidade Federal de Viçosa |
dc.source.none.fl_str_mv |
reponame:LOCUS Repositório Institucional da UFV instname:Universidade Federal de Viçosa (UFV) instacron:UFV |
instname_str |
Universidade Federal de Viçosa (UFV) |
instacron_str |
UFV |
institution |
UFV |
reponame_str |
LOCUS Repositório Institucional da UFV |
collection |
LOCUS Repositório Institucional da UFV |
repository.name.fl_str_mv |
LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV) |
repository.mail.fl_str_mv |
fabiojreis@ufv.br |
_version_ |
1817560037004935168 |