Entropy, mutual information, and population structure in genome-wide selection

Detalhes bibliográficos
Autor(a) principal: Simiqueli, Guilherme Ferreira
Data de Publicação: 2020
Tipo de documento: Tese
Idioma: eng
Título da fonte: LOCUS Repositório Institucional da UFV
Texto Completo: https://locus.ufv.br//handle/123456789/28716
Resumo: Different populations can compose the training set aiming for a better predictive ability of genomic prediction models. However, this practice has not always resulted in higher predictive ability and some studies have proposed to account population structure effect for a better prediction. Different strategies like principal components covariates, uni and multi-population models, alternative genomic relationships matrices, admixed proportions covariates, or a mix of them have been applied to genomic prediction. Thus, the first chapter aims to evaluate some combinations of these strategies to help the decision making about considering or not considering population structure on genomic prediction. Simulated polygenic traits with 0.1 and 0.5 heritability and real data were used to evaluated the strategies. Bias was lower, when multi-population model was used for low-heritability simulated trait. The accuracy of high-heritability trait was lower for strategies that used alternative genomic matrices that accounted for differences in allele frequency, only in admixed populations. Further, for real data, two commonly used genomic relationship matrices showed lower values of predictive ability for all traits, which are likely controlled by few quantitative trait loci. Therefore, accounting for population structure depends on trait heritability, trait architecture, and admixture level of population for obtaining lower bias without reduction of accuracy, and, consequently, success of genomic prediction. The second chapter address the fact that random k-fold cross-validation in genome wide selection can provide high estimates of predictive ability, due to the high degree of kinship between the training and validation sets. However, many breeding tree populations are less genetically related to the training sets and have different levels of phenotypic diversity. Therefore, this chapter proposed novel methods of splitting cross-validation sets, accounting genetic similarity and phenotypic diversity estimated via mutual information and entropy, respectively. These methods also verified how distribution of phenotypic and genotypic information affects genome wide selection of trees. The methods trustworthily fitted models, according to the entropy of tree breeding populations and their genetic relatedness to the training sets. Validations sets with more phenotypic diversity showed higher predictive ability and lower bias. Therefore, the phenotypic diversity should be added in tree breeding populations for higher genetic gain and better estimation of genomic breeding values and a consistent long-term tree breeding success. Keywords: Population structure. Accuracy. Bias. Mutual information. Entropy. K-fold cross-validation.
id UFV_7516e8cb6ab5798e909753c274f61ead
oai_identifier_str oai:locus.ufv.br:123456789/28716
network_acronym_str UFV
network_name_str LOCUS Repositório Institucional da UFV
repository_id_str 2145
spelling Entropy, mutual information, and population structure in genome-wide selectionEntropia, informação mútua e estrutura de populações na seleção genômica amplaGenômicaMelhoramento genéticoEstrutura populacionalEntropiaPrediçãoGenética QuantitativaDifferent populations can compose the training set aiming for a better predictive ability of genomic prediction models. However, this practice has not always resulted in higher predictive ability and some studies have proposed to account population structure effect for a better prediction. Different strategies like principal components covariates, uni and multi-population models, alternative genomic relationships matrices, admixed proportions covariates, or a mix of them have been applied to genomic prediction. Thus, the first chapter aims to evaluate some combinations of these strategies to help the decision making about considering or not considering population structure on genomic prediction. Simulated polygenic traits with 0.1 and 0.5 heritability and real data were used to evaluated the strategies. Bias was lower, when multi-population model was used for low-heritability simulated trait. The accuracy of high-heritability trait was lower for strategies that used alternative genomic matrices that accounted for differences in allele frequency, only in admixed populations. Further, for real data, two commonly used genomic relationship matrices showed lower values of predictive ability for all traits, which are likely controlled by few quantitative trait loci. Therefore, accounting for population structure depends on trait heritability, trait architecture, and admixture level of population for obtaining lower bias without reduction of accuracy, and, consequently, success of genomic prediction. The second chapter address the fact that random k-fold cross-validation in genome wide selection can provide high estimates of predictive ability, due to the high degree of kinship between the training and validation sets. However, many breeding tree populations are less genetically related to the training sets and have different levels of phenotypic diversity. Therefore, this chapter proposed novel methods of splitting cross-validation sets, accounting genetic similarity and phenotypic diversity estimated via mutual information and entropy, respectively. These methods also verified how distribution of phenotypic and genotypic information affects genome wide selection of trees. The methods trustworthily fitted models, according to the entropy of tree breeding populations and their genetic relatedness to the training sets. Validations sets with more phenotypic diversity showed higher predictive ability and lower bias. Therefore, the phenotypic diversity should be added in tree breeding populations for higher genetic gain and better estimation of genomic breeding values and a consistent long-term tree breeding success. Keywords: Population structure. Accuracy. Bias. Mutual information. Entropy. K-fold cross-validation.Na predição genômica, diferentes populações podem compor o conjunto de treinamento para melhorar a capacidade preditiva. Entretanto, esta prática não tem resultado em maiores capacidades preditivas e alguns estudos propuseram acomodar o efeito de estrutura populacional para melhor predição. Diferentes estratégias como componentes principais, modelos uni e multipopulacionais, matrizes alternativas de parentesco genômico, proporção de indivíduos misturados ou uma mistura destas estratégias tem sido empregada na predição genômica. Portanto, o objetivo deste primeiro capítulo foi avaliar algumas combinações destas estratégias para ajudar no processo de decisão sobre considerar ou não o efeito de estrutura populacional na predição genômica. Duas características poligênicas foram simuladas com herdabilidade de 0,1 e 0,5 e dados reais foram utilizados na avaliação. O viés de predição foi menor quando modelos multipopulacionais foram empregados para característica simulada de baixa herdabilidade. A acurácia da característica com alta herdabilidade (0,5) em populações misturadas foi baixa para estratégias que utilizaram matrizes de parentesco genômico que consideravam diferenças na frequência alélica. Além disso, nos dados reais, duas matrizes alternativas de parentesco genômico apresentaram baixa capacidade preditiva para as características avaliadas, as quais são provavelmente governadas por poucos loci. Portanto, a acomodação de estrutura populacional depende da arquitetura genética da característica, da herdabilidade e do nível de mistura da população para obtenção de menor viés sem reduzir a acurácia e, consequentemente, sucesso da predição genômica. O segundo capítulo aborda a validação cruzada na seleção genômica ampla. Esta validação quando feita aleatoriamente ocasiona em altos valores das estimativas de capacidade preditiva, provavelmente, devido ao alto grau de parentesco entre os conjuntos de treinamento e validação. No entanto, muitas populações de melhoramento florestal são fracamente relacionadas geneticamente com os conjuntos de treinamento e possuem diferentes níveis de diversidade fenotípica. Portanto, este capítulo propôs novos métodos de separação dos conjuntos de validação cruzada, considerando a similaridade genética e a diversidade fenotípica, obtidas por meio da informação mútua e entropia, respectivamente. Esses novos métodos também verificaram como a distribuição das informações fenotípicas e genotípicas afeta a seleção genômica ampla de espécies florestais. Os novos métodos ajustaram modelos mais confiáveis e que estão de acordo com a entropia das populações de melhoramento e sua relação genética com os conjuntos de treinamento. Os conjuntos de validação com maior diversidade fenotípica apresentaram maior capacidade preditiva e menor viés. Portanto, a diversidade fenotípica deve ser adicionada nas populações de melhoramento para maior ganho genético e melhor estimativa dos valores genéticos genômicos. Palavras-chave: Estrutura populacional. Acurácia. Viés. Informação mútua. Entropia. Validação cruzada.Universidade Federal de ViçosaResende, Marcos Deon Vilela dehttp://lattes.cnpq.br/3748640680505163Simiqueli, Guilherme Ferreira2022-03-04T13:28:01Z2022-03-04T13:28:01Z2020-07-23info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfSIMIQUELI, Guilherme Ferreira. Entropy, mutual information, and population structure in genome-wide selection. 2020. 122 f. Tese (Doutorado em Genética e Melhoramento) - Universidade Federal de Viçosa, Viçosa. 2020.https://locus.ufv.br//handle/123456789/28716enginfo:eu-repo/semantics/openAccessreponame:LOCUS Repositório Institucional da UFVinstname:Universidade Federal de Viçosa (UFV)instacron:UFV2024-07-12T08:42:25Zoai:locus.ufv.br:123456789/28716Repositório InstitucionalPUBhttps://www.locus.ufv.br/oai/requestfabiojreis@ufv.bropendoar:21452024-07-12T08:42:25LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV)false
dc.title.none.fl_str_mv Entropy, mutual information, and population structure in genome-wide selection
Entropia, informação mútua e estrutura de populações na seleção genômica ampla
title Entropy, mutual information, and population structure in genome-wide selection
spellingShingle Entropy, mutual information, and population structure in genome-wide selection
Simiqueli, Guilherme Ferreira
Genômica
Melhoramento genético
Estrutura populacional
Entropia
Predição
Genética Quantitativa
title_short Entropy, mutual information, and population structure in genome-wide selection
title_full Entropy, mutual information, and population structure in genome-wide selection
title_fullStr Entropy, mutual information, and population structure in genome-wide selection
title_full_unstemmed Entropy, mutual information, and population structure in genome-wide selection
title_sort Entropy, mutual information, and population structure in genome-wide selection
author Simiqueli, Guilherme Ferreira
author_facet Simiqueli, Guilherme Ferreira
author_role author
dc.contributor.none.fl_str_mv Resende, Marcos Deon Vilela de
http://lattes.cnpq.br/3748640680505163
dc.contributor.author.fl_str_mv Simiqueli, Guilherme Ferreira
dc.subject.por.fl_str_mv Genômica
Melhoramento genético
Estrutura populacional
Entropia
Predição
Genética Quantitativa
topic Genômica
Melhoramento genético
Estrutura populacional
Entropia
Predição
Genética Quantitativa
description Different populations can compose the training set aiming for a better predictive ability of genomic prediction models. However, this practice has not always resulted in higher predictive ability and some studies have proposed to account population structure effect for a better prediction. Different strategies like principal components covariates, uni and multi-population models, alternative genomic relationships matrices, admixed proportions covariates, or a mix of them have been applied to genomic prediction. Thus, the first chapter aims to evaluate some combinations of these strategies to help the decision making about considering or not considering population structure on genomic prediction. Simulated polygenic traits with 0.1 and 0.5 heritability and real data were used to evaluated the strategies. Bias was lower, when multi-population model was used for low-heritability simulated trait. The accuracy of high-heritability trait was lower for strategies that used alternative genomic matrices that accounted for differences in allele frequency, only in admixed populations. Further, for real data, two commonly used genomic relationship matrices showed lower values of predictive ability for all traits, which are likely controlled by few quantitative trait loci. Therefore, accounting for population structure depends on trait heritability, trait architecture, and admixture level of population for obtaining lower bias without reduction of accuracy, and, consequently, success of genomic prediction. The second chapter address the fact that random k-fold cross-validation in genome wide selection can provide high estimates of predictive ability, due to the high degree of kinship between the training and validation sets. However, many breeding tree populations are less genetically related to the training sets and have different levels of phenotypic diversity. Therefore, this chapter proposed novel methods of splitting cross-validation sets, accounting genetic similarity and phenotypic diversity estimated via mutual information and entropy, respectively. These methods also verified how distribution of phenotypic and genotypic information affects genome wide selection of trees. The methods trustworthily fitted models, according to the entropy of tree breeding populations and their genetic relatedness to the training sets. Validations sets with more phenotypic diversity showed higher predictive ability and lower bias. Therefore, the phenotypic diversity should be added in tree breeding populations for higher genetic gain and better estimation of genomic breeding values and a consistent long-term tree breeding success. Keywords: Population structure. Accuracy. Bias. Mutual information. Entropy. K-fold cross-validation.
publishDate 2020
dc.date.none.fl_str_mv 2020-07-23
2022-03-04T13:28:01Z
2022-03-04T13:28:01Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv SIMIQUELI, Guilherme Ferreira. Entropy, mutual information, and population structure in genome-wide selection. 2020. 122 f. Tese (Doutorado em Genética e Melhoramento) - Universidade Federal de Viçosa, Viçosa. 2020.
https://locus.ufv.br//handle/123456789/28716
identifier_str_mv SIMIQUELI, Guilherme Ferreira. Entropy, mutual information, and population structure in genome-wide selection. 2020. 122 f. Tese (Doutorado em Genética e Melhoramento) - Universidade Federal de Viçosa, Viçosa. 2020.
url https://locus.ufv.br//handle/123456789/28716
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Viçosa
publisher.none.fl_str_mv Universidade Federal de Viçosa
dc.source.none.fl_str_mv reponame:LOCUS Repositório Institucional da UFV
instname:Universidade Federal de Viçosa (UFV)
instacron:UFV
instname_str Universidade Federal de Viçosa (UFV)
instacron_str UFV
institution UFV
reponame_str LOCUS Repositório Institucional da UFV
collection LOCUS Repositório Institucional da UFV
repository.name.fl_str_mv LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV)
repository.mail.fl_str_mv fabiojreis@ufv.br
_version_ 1817560037004935168