Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja

Detalhes bibliográficos
Autor(a) principal: Souza, Rafael Rodrigues de
Data de Publicação: 2024
Tipo de documento: Tese
Idioma: por
Título da fonte: Manancial - Repositório Digital da UFSM
dARK ID: ark:/26339/001300000n39x
Texto Completo: http://repositorio.ufsm.br/handle/1/32041
Resumo: Research on soybean genetic divergence is intensively based on indirect methodologies that use phenotypic characters. Principal components, canonical variables, and hierarchical analyses are among the main applied methodologies. Although these tools possess wide applicability, it is important to highlight that their use does not always include a representative sample basement. In other words, there is a lack of previous sampling definition, so that, a lot of times, empirical decisions are taken. In this sense, the present study aims to analyze the response of genetic divergence techniques to the variation in the number of sampled plants; to define a reference sample size for principal component techniques, canonical variables, and grouping techniques in soybean; and, to propose new robust approaches to define sample size. Therefore, field trials were conducted during the 2017/2018 growing season, in two locations of Rio Grande do Sul and on three sowing dates, totaling six experiments. The experimental units were composed of five rows, with three meters in length, spaced by 0.45 meters. A completely randomized block design was used to evaluate 20 soybean cultivars, with three repetitions in each experiment. During grain maturation, ten morphological characters were assessed in 20 plants per experimental unit, totaling 7,200 individually measured plants. Next, simulations with reposition were performed (bootstrap resampling) in sampling scenarios varying from 1 to 100 plants per experimental unit to evaluate the eigenvalues of the principal components, the canonical components of the canonical variables, and the coefficient of cophenetic correlation deriving from the combination of nine dissimilarity measures and seven grouping methods. These bootstrap simulations were carried out individually for the six experiments, followed by a joint analysis of the experiments. Regarding the sample dimensioning for the principal component technique, the method of error as a percentage of the average was used. For the second study, related to canonical variables, an approach which combined nonlinear models and a maximum curvature point was used to estimate sample size. In the third study, a methodology was developed for sample size definition, which was based on unsupervised machine learning, along with bayesian optimization, plus a modification of the maximum curvature point through perpendicular distances. An overall gradual improvement was observed in the estimate of the eigenvalues of the canonical variables and the cophenetic coefficient with an increase in the number of sampled plants. It was observed that 18 plants per experimental unit were enough to estimate the first two principal components, whereas 36 plants were necessary to estimate the canonical variables. In the hierarchical analyses, a variation in the representative sample size was verified, which was dependent on the dissimilarity measure and the grouping method used. However, it is suggested that 27 plants per experimental unit were enough for a representative sampling in hierarchical analyses. Thus, it is possible to optimize the use of the methodologies of principal components, canonical variables, and hierarchical analyses, ensuring the reliability of its results and avoiding empirical decision-making on the sampling number in soybean.
id UFSM_d1f8c4bf93d1d15785c98f4aaecd3885
oai_identifier_str oai:repositorio.ufsm.br:1/32041
network_acronym_str UFSM
network_name_str Manancial - Repositório Digital da UFSM
repository_id_str
spelling Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de sojaSample dimensioning in principal component analyses, canonical variables, and grouping in soybean cultivarsBootstrapExtreme Gradient BoostingModelingExperimental planningModelagemPlanejamento experimentalCNPQ::CIENCIAS AGRARIAS::AGRONOMIAResearch on soybean genetic divergence is intensively based on indirect methodologies that use phenotypic characters. Principal components, canonical variables, and hierarchical analyses are among the main applied methodologies. Although these tools possess wide applicability, it is important to highlight that their use does not always include a representative sample basement. In other words, there is a lack of previous sampling definition, so that, a lot of times, empirical decisions are taken. In this sense, the present study aims to analyze the response of genetic divergence techniques to the variation in the number of sampled plants; to define a reference sample size for principal component techniques, canonical variables, and grouping techniques in soybean; and, to propose new robust approaches to define sample size. Therefore, field trials were conducted during the 2017/2018 growing season, in two locations of Rio Grande do Sul and on three sowing dates, totaling six experiments. The experimental units were composed of five rows, with three meters in length, spaced by 0.45 meters. A completely randomized block design was used to evaluate 20 soybean cultivars, with three repetitions in each experiment. During grain maturation, ten morphological characters were assessed in 20 plants per experimental unit, totaling 7,200 individually measured plants. Next, simulations with reposition were performed (bootstrap resampling) in sampling scenarios varying from 1 to 100 plants per experimental unit to evaluate the eigenvalues of the principal components, the canonical components of the canonical variables, and the coefficient of cophenetic correlation deriving from the combination of nine dissimilarity measures and seven grouping methods. These bootstrap simulations were carried out individually for the six experiments, followed by a joint analysis of the experiments. Regarding the sample dimensioning for the principal component technique, the method of error as a percentage of the average was used. For the second study, related to canonical variables, an approach which combined nonlinear models and a maximum curvature point was used to estimate sample size. In the third study, a methodology was developed for sample size definition, which was based on unsupervised machine learning, along with bayesian optimization, plus a modification of the maximum curvature point through perpendicular distances. An overall gradual improvement was observed in the estimate of the eigenvalues of the canonical variables and the cophenetic coefficient with an increase in the number of sampled plants. It was observed that 18 plants per experimental unit were enough to estimate the first two principal components, whereas 36 plants were necessary to estimate the canonical variables. In the hierarchical analyses, a variation in the representative sample size was verified, which was dependent on the dissimilarity measure and the grouping method used. However, it is suggested that 27 plants per experimental unit were enough for a representative sampling in hierarchical analyses. Thus, it is possible to optimize the use of the methodologies of principal components, canonical variables, and hierarchical analyses, ensuring the reliability of its results and avoiding empirical decision-making on the sampling number in soybean.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESIntensivamente, pesquisas fundamentam-se no uso de metodologias indiretas para determinar a divergência genética por meio de caracteres fenotípicos em soja. Entre as principais metodologias, destacam-se o uso de componentes principais, variáveis canônicas e análises hierárquicas. Embora essas ferramentas apresentem ampla aplicabilidade, é importante ressaltar que seu uso nem sempre é acompanhado de um embasamento amostral representativo. Ou seja, normalmente há uma ausência de prévia definição amostral, de modo que decisões empíricas são na maioria das vezes tomadas. Neste sentido, o presente estudo tem como objetivos analisarem a resposta de técnicas de divergência genética frente às variações no número de plantas amostradas; definir um tamanho de amostra referência para técnicas componentes principais, variáveis canônicas e técnicas de agrupamento em soja; e, propor novas abordagens robustas de definição do tamanho amostral. Logo, foram conduzidos ensaios de campo durante a safra agrícola de 2017/2018, em dois locais no Rio Grande do Sul e três épocas de semeadura, totalizando seis experimentos. As unidades experimentais foram compostas por cinco fileiras, com três metros de comprimento, espaçadas em 0,45 metros. O delineamento de blocos completos ao acaso foi utilizado para avaliar 20 cultivares de soja, com três repetições em cada experimento. Durante a maturação, foram avaliadas dez características morfológicas em 20 plantas por unidade experimental, totalizando 7.200 plantas mensuradas individualmente. A seguir, realizaram-se simulações com reposição (reamostragem bootstrap) em cenários amostrais variando de 1 a 100 plantas por unidade experimental para avaliar os autovalores dos componentes principais, os componentes canônicos das variáveis canônicas e o coeficiente de correlação cofenético oriundo da combinação de nove medidas de dissimilaridade e sete métodos de agrupamento. Essas simulações bootstrap foram conduzidas individualmente para os seis experimentos, seguida por uma análise conjunta dos experimentos. No que diz respeito ao dimensionamento amostral para a técnica de componentes principais, utilizou-se o método do erro em porcentagem da média. Para o segundo estudo, relacionado às variáveis canônicas, empregou-se uma abordagem que combinou modelos não lineares e ponto de máxima curvatura para estimar o tamanho da amostra. No terceiro estudo, desenvolveu-se uma metodologia para definição amostral baseada em aprendizado de máquina não supervisionado, juntamente com otimização bayesiana, somado a uma modificação do método de máxima curvatura por meio de distâncias perpendiculares. Foi observada uma melhoria gradual na estimativa dos autovalores das variáveis canônicas e do coeficiente de correlação cofenético com o aumento do número de plantas amostradas. Constatou-se que 18 plantas por unidade experimental foram suficientes para estimar os dois primeiros componentes principais, enquanto 36 plantas foram necessárias para as variáveis canônicas. Nas análises hierárquicas, verificou-se uma variação no tamanho amostral representativo, sendo este dependente da medida de dissimilaridade e do método de agrupamento utilizado. No entanto, sugere-se que 27 plantas por unidade experimental foram suficientes para uma amostragem representativa em análises hierárquicas. Deste modo, se possibilita otimizar o uso das metodologias de componentes principais, variáveis canônicas e análises hierárquicas, assegurando a confiabilidade dos seus resultados e inibindo tomadas de decisões empíricas sobre o tamanho amostral em soja.Universidade Federal de Santa MariaBrasilAgronomiaUFSMPrograma de Pós-Graduação em AgronomiaCentro de Ciências RuraisCargnelutti Filho, Albertohttp://lattes.cnpq.br/0233728865094243Haesbaert, Fernando MachadoCarvalho, Ivan RicardoToebe, MarcosMaziero, Sandra MariaSouza, Rafael Rodrigues de2024-06-17T11:14:56Z2024-06-17T11:14:56Z2024-05-17info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://repositorio.ufsm.br/handle/1/32041ark:/26339/001300000n39xporAttribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessreponame:Manancial - Repositório Digital da UFSMinstname:Universidade Federal de Santa Maria (UFSM)instacron:UFSM2024-06-17T11:14:56Zoai:repositorio.ufsm.br:1/32041Biblioteca Digital de Teses e Dissertaçõeshttps://repositorio.ufsm.br/ONGhttps://repositorio.ufsm.br/oai/requestatendimento.sib@ufsm.br||tedebc@gmail.comopendoar:2024-06-17T11:14:56Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM)false
dc.title.none.fl_str_mv Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
Sample dimensioning in principal component analyses, canonical variables, and grouping in soybean cultivars
title Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
spellingShingle Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
Souza, Rafael Rodrigues de
Bootstrap
Extreme Gradient Boosting
Modeling
Experimental planning
Modelagem
Planejamento experimental
CNPQ::CIENCIAS AGRARIAS::AGRONOMIA
title_short Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
title_full Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
title_fullStr Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
title_full_unstemmed Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
title_sort Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
author Souza, Rafael Rodrigues de
author_facet Souza, Rafael Rodrigues de
author_role author
dc.contributor.none.fl_str_mv Cargnelutti Filho, Alberto
http://lattes.cnpq.br/0233728865094243
Haesbaert, Fernando Machado
Carvalho, Ivan Ricardo
Toebe, Marcos
Maziero, Sandra Maria
dc.contributor.author.fl_str_mv Souza, Rafael Rodrigues de
dc.subject.por.fl_str_mv Bootstrap
Extreme Gradient Boosting
Modeling
Experimental planning
Modelagem
Planejamento experimental
CNPQ::CIENCIAS AGRARIAS::AGRONOMIA
topic Bootstrap
Extreme Gradient Boosting
Modeling
Experimental planning
Modelagem
Planejamento experimental
CNPQ::CIENCIAS AGRARIAS::AGRONOMIA
description Research on soybean genetic divergence is intensively based on indirect methodologies that use phenotypic characters. Principal components, canonical variables, and hierarchical analyses are among the main applied methodologies. Although these tools possess wide applicability, it is important to highlight that their use does not always include a representative sample basement. In other words, there is a lack of previous sampling definition, so that, a lot of times, empirical decisions are taken. In this sense, the present study aims to analyze the response of genetic divergence techniques to the variation in the number of sampled plants; to define a reference sample size for principal component techniques, canonical variables, and grouping techniques in soybean; and, to propose new robust approaches to define sample size. Therefore, field trials were conducted during the 2017/2018 growing season, in two locations of Rio Grande do Sul and on three sowing dates, totaling six experiments. The experimental units were composed of five rows, with three meters in length, spaced by 0.45 meters. A completely randomized block design was used to evaluate 20 soybean cultivars, with three repetitions in each experiment. During grain maturation, ten morphological characters were assessed in 20 plants per experimental unit, totaling 7,200 individually measured plants. Next, simulations with reposition were performed (bootstrap resampling) in sampling scenarios varying from 1 to 100 plants per experimental unit to evaluate the eigenvalues of the principal components, the canonical components of the canonical variables, and the coefficient of cophenetic correlation deriving from the combination of nine dissimilarity measures and seven grouping methods. These bootstrap simulations were carried out individually for the six experiments, followed by a joint analysis of the experiments. Regarding the sample dimensioning for the principal component technique, the method of error as a percentage of the average was used. For the second study, related to canonical variables, an approach which combined nonlinear models and a maximum curvature point was used to estimate sample size. In the third study, a methodology was developed for sample size definition, which was based on unsupervised machine learning, along with bayesian optimization, plus a modification of the maximum curvature point through perpendicular distances. An overall gradual improvement was observed in the estimate of the eigenvalues of the canonical variables and the cophenetic coefficient with an increase in the number of sampled plants. It was observed that 18 plants per experimental unit were enough to estimate the first two principal components, whereas 36 plants were necessary to estimate the canonical variables. In the hierarchical analyses, a variation in the representative sample size was verified, which was dependent on the dissimilarity measure and the grouping method used. However, it is suggested that 27 plants per experimental unit were enough for a representative sampling in hierarchical analyses. Thus, it is possible to optimize the use of the methodologies of principal components, canonical variables, and hierarchical analyses, ensuring the reliability of its results and avoiding empirical decision-making on the sampling number in soybean.
publishDate 2024
dc.date.none.fl_str_mv 2024-06-17T11:14:56Z
2024-06-17T11:14:56Z
2024-05-17
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://repositorio.ufsm.br/handle/1/32041
dc.identifier.dark.fl_str_mv ark:/26339/001300000n39x
url http://repositorio.ufsm.br/handle/1/32041
identifier_str_mv ark:/26339/001300000n39x
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Santa Maria
Brasil
Agronomia
UFSM
Programa de Pós-Graduação em Agronomia
Centro de Ciências Rurais
publisher.none.fl_str_mv Universidade Federal de Santa Maria
Brasil
Agronomia
UFSM
Programa de Pós-Graduação em Agronomia
Centro de Ciências Rurais
dc.source.none.fl_str_mv reponame:Manancial - Repositório Digital da UFSM
instname:Universidade Federal de Santa Maria (UFSM)
instacron:UFSM
instname_str Universidade Federal de Santa Maria (UFSM)
instacron_str UFSM
institution UFSM
reponame_str Manancial - Repositório Digital da UFSM
collection Manancial - Repositório Digital da UFSM
repository.name.fl_str_mv Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM)
repository.mail.fl_str_mv atendimento.sib@ufsm.br||tedebc@gmail.com
_version_ 1815172365354532864