Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja
Autor(a) principal: | |
---|---|
Data de Publicação: | 2024 |
Tipo de documento: | Tese |
Idioma: | por |
Título da fonte: | Manancial - Repositório Digital da UFSM |
dARK ID: | ark:/26339/001300000n39x |
Texto Completo: | http://repositorio.ufsm.br/handle/1/32041 |
Resumo: | Research on soybean genetic divergence is intensively based on indirect methodologies that use phenotypic characters. Principal components, canonical variables, and hierarchical analyses are among the main applied methodologies. Although these tools possess wide applicability, it is important to highlight that their use does not always include a representative sample basement. In other words, there is a lack of previous sampling definition, so that, a lot of times, empirical decisions are taken. In this sense, the present study aims to analyze the response of genetic divergence techniques to the variation in the number of sampled plants; to define a reference sample size for principal component techniques, canonical variables, and grouping techniques in soybean; and, to propose new robust approaches to define sample size. Therefore, field trials were conducted during the 2017/2018 growing season, in two locations of Rio Grande do Sul and on three sowing dates, totaling six experiments. The experimental units were composed of five rows, with three meters in length, spaced by 0.45 meters. A completely randomized block design was used to evaluate 20 soybean cultivars, with three repetitions in each experiment. During grain maturation, ten morphological characters were assessed in 20 plants per experimental unit, totaling 7,200 individually measured plants. Next, simulations with reposition were performed (bootstrap resampling) in sampling scenarios varying from 1 to 100 plants per experimental unit to evaluate the eigenvalues of the principal components, the canonical components of the canonical variables, and the coefficient of cophenetic correlation deriving from the combination of nine dissimilarity measures and seven grouping methods. These bootstrap simulations were carried out individually for the six experiments, followed by a joint analysis of the experiments. Regarding the sample dimensioning for the principal component technique, the method of error as a percentage of the average was used. For the second study, related to canonical variables, an approach which combined nonlinear models and a maximum curvature point was used to estimate sample size. In the third study, a methodology was developed for sample size definition, which was based on unsupervised machine learning, along with bayesian optimization, plus a modification of the maximum curvature point through perpendicular distances. An overall gradual improvement was observed in the estimate of the eigenvalues of the canonical variables and the cophenetic coefficient with an increase in the number of sampled plants. It was observed that 18 plants per experimental unit were enough to estimate the first two principal components, whereas 36 plants were necessary to estimate the canonical variables. In the hierarchical analyses, a variation in the representative sample size was verified, which was dependent on the dissimilarity measure and the grouping method used. However, it is suggested that 27 plants per experimental unit were enough for a representative sampling in hierarchical analyses. Thus, it is possible to optimize the use of the methodologies of principal components, canonical variables, and hierarchical analyses, ensuring the reliability of its results and avoiding empirical decision-making on the sampling number in soybean. |
id |
UFSM_d1f8c4bf93d1d15785c98f4aaecd3885 |
---|---|
oai_identifier_str |
oai:repositorio.ufsm.br:1/32041 |
network_acronym_str |
UFSM |
network_name_str |
Manancial - Repositório Digital da UFSM |
repository_id_str |
|
spelling |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de sojaSample dimensioning in principal component analyses, canonical variables, and grouping in soybean cultivarsBootstrapExtreme Gradient BoostingModelingExperimental planningModelagemPlanejamento experimentalCNPQ::CIENCIAS AGRARIAS::AGRONOMIAResearch on soybean genetic divergence is intensively based on indirect methodologies that use phenotypic characters. Principal components, canonical variables, and hierarchical analyses are among the main applied methodologies. Although these tools possess wide applicability, it is important to highlight that their use does not always include a representative sample basement. In other words, there is a lack of previous sampling definition, so that, a lot of times, empirical decisions are taken. In this sense, the present study aims to analyze the response of genetic divergence techniques to the variation in the number of sampled plants; to define a reference sample size for principal component techniques, canonical variables, and grouping techniques in soybean; and, to propose new robust approaches to define sample size. Therefore, field trials were conducted during the 2017/2018 growing season, in two locations of Rio Grande do Sul and on three sowing dates, totaling six experiments. The experimental units were composed of five rows, with three meters in length, spaced by 0.45 meters. A completely randomized block design was used to evaluate 20 soybean cultivars, with three repetitions in each experiment. During grain maturation, ten morphological characters were assessed in 20 plants per experimental unit, totaling 7,200 individually measured plants. Next, simulations with reposition were performed (bootstrap resampling) in sampling scenarios varying from 1 to 100 plants per experimental unit to evaluate the eigenvalues of the principal components, the canonical components of the canonical variables, and the coefficient of cophenetic correlation deriving from the combination of nine dissimilarity measures and seven grouping methods. These bootstrap simulations were carried out individually for the six experiments, followed by a joint analysis of the experiments. Regarding the sample dimensioning for the principal component technique, the method of error as a percentage of the average was used. For the second study, related to canonical variables, an approach which combined nonlinear models and a maximum curvature point was used to estimate sample size. In the third study, a methodology was developed for sample size definition, which was based on unsupervised machine learning, along with bayesian optimization, plus a modification of the maximum curvature point through perpendicular distances. An overall gradual improvement was observed in the estimate of the eigenvalues of the canonical variables and the cophenetic coefficient with an increase in the number of sampled plants. It was observed that 18 plants per experimental unit were enough to estimate the first two principal components, whereas 36 plants were necessary to estimate the canonical variables. In the hierarchical analyses, a variation in the representative sample size was verified, which was dependent on the dissimilarity measure and the grouping method used. However, it is suggested that 27 plants per experimental unit were enough for a representative sampling in hierarchical analyses. Thus, it is possible to optimize the use of the methodologies of principal components, canonical variables, and hierarchical analyses, ensuring the reliability of its results and avoiding empirical decision-making on the sampling number in soybean.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESIntensivamente, pesquisas fundamentam-se no uso de metodologias indiretas para determinar a divergência genética por meio de caracteres fenotípicos em soja. Entre as principais metodologias, destacam-se o uso de componentes principais, variáveis canônicas e análises hierárquicas. Embora essas ferramentas apresentem ampla aplicabilidade, é importante ressaltar que seu uso nem sempre é acompanhado de um embasamento amostral representativo. Ou seja, normalmente há uma ausência de prévia definição amostral, de modo que decisões empíricas são na maioria das vezes tomadas. Neste sentido, o presente estudo tem como objetivos analisarem a resposta de técnicas de divergência genética frente às variações no número de plantas amostradas; definir um tamanho de amostra referência para técnicas componentes principais, variáveis canônicas e técnicas de agrupamento em soja; e, propor novas abordagens robustas de definição do tamanho amostral. Logo, foram conduzidos ensaios de campo durante a safra agrícola de 2017/2018, em dois locais no Rio Grande do Sul e três épocas de semeadura, totalizando seis experimentos. As unidades experimentais foram compostas por cinco fileiras, com três metros de comprimento, espaçadas em 0,45 metros. O delineamento de blocos completos ao acaso foi utilizado para avaliar 20 cultivares de soja, com três repetições em cada experimento. Durante a maturação, foram avaliadas dez características morfológicas em 20 plantas por unidade experimental, totalizando 7.200 plantas mensuradas individualmente. A seguir, realizaram-se simulações com reposição (reamostragem bootstrap) em cenários amostrais variando de 1 a 100 plantas por unidade experimental para avaliar os autovalores dos componentes principais, os componentes canônicos das variáveis canônicas e o coeficiente de correlação cofenético oriundo da combinação de nove medidas de dissimilaridade e sete métodos de agrupamento. Essas simulações bootstrap foram conduzidas individualmente para os seis experimentos, seguida por uma análise conjunta dos experimentos. No que diz respeito ao dimensionamento amostral para a técnica de componentes principais, utilizou-se o método do erro em porcentagem da média. Para o segundo estudo, relacionado às variáveis canônicas, empregou-se uma abordagem que combinou modelos não lineares e ponto de máxima curvatura para estimar o tamanho da amostra. No terceiro estudo, desenvolveu-se uma metodologia para definição amostral baseada em aprendizado de máquina não supervisionado, juntamente com otimização bayesiana, somado a uma modificação do método de máxima curvatura por meio de distâncias perpendiculares. Foi observada uma melhoria gradual na estimativa dos autovalores das variáveis canônicas e do coeficiente de correlação cofenético com o aumento do número de plantas amostradas. Constatou-se que 18 plantas por unidade experimental foram suficientes para estimar os dois primeiros componentes principais, enquanto 36 plantas foram necessárias para as variáveis canônicas. Nas análises hierárquicas, verificou-se uma variação no tamanho amostral representativo, sendo este dependente da medida de dissimilaridade e do método de agrupamento utilizado. No entanto, sugere-se que 27 plantas por unidade experimental foram suficientes para uma amostragem representativa em análises hierárquicas. Deste modo, se possibilita otimizar o uso das metodologias de componentes principais, variáveis canônicas e análises hierárquicas, assegurando a confiabilidade dos seus resultados e inibindo tomadas de decisões empíricas sobre o tamanho amostral em soja.Universidade Federal de Santa MariaBrasilAgronomiaUFSMPrograma de Pós-Graduação em AgronomiaCentro de Ciências RuraisCargnelutti Filho, Albertohttp://lattes.cnpq.br/0233728865094243Haesbaert, Fernando MachadoCarvalho, Ivan RicardoToebe, MarcosMaziero, Sandra MariaSouza, Rafael Rodrigues de2024-06-17T11:14:56Z2024-06-17T11:14:56Z2024-05-17info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://repositorio.ufsm.br/handle/1/32041ark:/26339/001300000n39xporAttribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessreponame:Manancial - Repositório Digital da UFSMinstname:Universidade Federal de Santa Maria (UFSM)instacron:UFSM2024-06-17T11:14:56Zoai:repositorio.ufsm.br:1/32041Biblioteca Digital de Teses e Dissertaçõeshttps://repositorio.ufsm.br/ONGhttps://repositorio.ufsm.br/oai/requestatendimento.sib@ufsm.br||tedebc@gmail.comopendoar:2024-06-17T11:14:56Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM)false |
dc.title.none.fl_str_mv |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja Sample dimensioning in principal component analyses, canonical variables, and grouping in soybean cultivars |
title |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja |
spellingShingle |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja Souza, Rafael Rodrigues de Bootstrap Extreme Gradient Boosting Modeling Experimental planning Modelagem Planejamento experimental CNPQ::CIENCIAS AGRARIAS::AGRONOMIA |
title_short |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja |
title_full |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja |
title_fullStr |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja |
title_full_unstemmed |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja |
title_sort |
Dimensionamento amostral em análises de componentes principais, variáveis canônicas e agrupamento em cultivares de soja |
author |
Souza, Rafael Rodrigues de |
author_facet |
Souza, Rafael Rodrigues de |
author_role |
author |
dc.contributor.none.fl_str_mv |
Cargnelutti Filho, Alberto http://lattes.cnpq.br/0233728865094243 Haesbaert, Fernando Machado Carvalho, Ivan Ricardo Toebe, Marcos Maziero, Sandra Maria |
dc.contributor.author.fl_str_mv |
Souza, Rafael Rodrigues de |
dc.subject.por.fl_str_mv |
Bootstrap Extreme Gradient Boosting Modeling Experimental planning Modelagem Planejamento experimental CNPQ::CIENCIAS AGRARIAS::AGRONOMIA |
topic |
Bootstrap Extreme Gradient Boosting Modeling Experimental planning Modelagem Planejamento experimental CNPQ::CIENCIAS AGRARIAS::AGRONOMIA |
description |
Research on soybean genetic divergence is intensively based on indirect methodologies that use phenotypic characters. Principal components, canonical variables, and hierarchical analyses are among the main applied methodologies. Although these tools possess wide applicability, it is important to highlight that their use does not always include a representative sample basement. In other words, there is a lack of previous sampling definition, so that, a lot of times, empirical decisions are taken. In this sense, the present study aims to analyze the response of genetic divergence techniques to the variation in the number of sampled plants; to define a reference sample size for principal component techniques, canonical variables, and grouping techniques in soybean; and, to propose new robust approaches to define sample size. Therefore, field trials were conducted during the 2017/2018 growing season, in two locations of Rio Grande do Sul and on three sowing dates, totaling six experiments. The experimental units were composed of five rows, with three meters in length, spaced by 0.45 meters. A completely randomized block design was used to evaluate 20 soybean cultivars, with three repetitions in each experiment. During grain maturation, ten morphological characters were assessed in 20 plants per experimental unit, totaling 7,200 individually measured plants. Next, simulations with reposition were performed (bootstrap resampling) in sampling scenarios varying from 1 to 100 plants per experimental unit to evaluate the eigenvalues of the principal components, the canonical components of the canonical variables, and the coefficient of cophenetic correlation deriving from the combination of nine dissimilarity measures and seven grouping methods. These bootstrap simulations were carried out individually for the six experiments, followed by a joint analysis of the experiments. Regarding the sample dimensioning for the principal component technique, the method of error as a percentage of the average was used. For the second study, related to canonical variables, an approach which combined nonlinear models and a maximum curvature point was used to estimate sample size. In the third study, a methodology was developed for sample size definition, which was based on unsupervised machine learning, along with bayesian optimization, plus a modification of the maximum curvature point through perpendicular distances. An overall gradual improvement was observed in the estimate of the eigenvalues of the canonical variables and the cophenetic coefficient with an increase in the number of sampled plants. It was observed that 18 plants per experimental unit were enough to estimate the first two principal components, whereas 36 plants were necessary to estimate the canonical variables. In the hierarchical analyses, a variation in the representative sample size was verified, which was dependent on the dissimilarity measure and the grouping method used. However, it is suggested that 27 plants per experimental unit were enough for a representative sampling in hierarchical analyses. Thus, it is possible to optimize the use of the methodologies of principal components, canonical variables, and hierarchical analyses, ensuring the reliability of its results and avoiding empirical decision-making on the sampling number in soybean. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024-06-17T11:14:56Z 2024-06-17T11:14:56Z 2024-05-17 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://repositorio.ufsm.br/handle/1/32041 |
dc.identifier.dark.fl_str_mv |
ark:/26339/001300000n39x |
url |
http://repositorio.ufsm.br/handle/1/32041 |
identifier_str_mv |
ark:/26339/001300000n39x |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Santa Maria Brasil Agronomia UFSM Programa de Pós-Graduação em Agronomia Centro de Ciências Rurais |
publisher.none.fl_str_mv |
Universidade Federal de Santa Maria Brasil Agronomia UFSM Programa de Pós-Graduação em Agronomia Centro de Ciências Rurais |
dc.source.none.fl_str_mv |
reponame:Manancial - Repositório Digital da UFSM instname:Universidade Federal de Santa Maria (UFSM) instacron:UFSM |
instname_str |
Universidade Federal de Santa Maria (UFSM) |
instacron_str |
UFSM |
institution |
UFSM |
reponame_str |
Manancial - Repositório Digital da UFSM |
collection |
Manancial - Repositório Digital da UFSM |
repository.name.fl_str_mv |
Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM) |
repository.mail.fl_str_mv |
atendimento.sib@ufsm.br||tedebc@gmail.com |
_version_ |
1815172365354532864 |