Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum

Detalhes bibliográficos
Autor(a) principal: Sampaio, Clara Yokochi de Sousa
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/138511
Resumo: As a consequence of science and technology evolution,data dimensionality has been growing and, along side, the need to solve problems containing these complex types of data. Generically, a data analysis problem is termed high-dimensional when the amount of variables used to explain a certain phenomeno is higher than the number of instances of this same event in a dataset. In the context ofLinear and Generalized Linear Models,high-dimensional datasets provoke the non invertibility of the Fisher Information Matrix, which interferes with the estimation of the model parameters. Most regression models resort to intricate numerical and iterative methods for the assessment of the model coefficients,which often require the non-singularity oft he Fishe rInformation Matrix. Totackle the difficulties that emerge when the model’s Fisher Information Matrix is singular, a series of regularization methods have been used to analyze which predictor variables have significant linkage to the outcome and estimate their coefficients. Ridge,Least Absolute Shrink age and Selection Operator (LASSO) and Elastic Net methods were at the outset of regularization techniques and, because of this,they are seen as being eminently linked. The algorithms behind these three methods differ in few aspects, see mingly in such a way thatL ASSO overcomes Ridge’s and Elastic Net overcomes LASSO’s difficulties. To counter fraud connected to the mislabeling of product origin, regularization methods were applied to predict th elocation of origin of Ruditapes philippinarum,a species of saltwater clam that is commercially harvested for human consumption. The exploited dataset constitutes 30 clam samples, detailing information on 44 composition features,with the purpose of identifying which features distinguish between three geographic origins: Ria de Vigo, Ria de Aveiro, Estuário do Tejo, i.e, a classical Multinomial Logistic Regression problem. However, given the high-dimensionality of the dataset (number of variables higher than the number of observations),theestimationofthemodelcoefficientsposses,asexplainedabove,furtherdifficulties. To overcome this problem, the three touched upon regularization methods were applied to model the origin of the clams. Additionally, since datasets of only 30 samples challenge the process of model validation, the re-sampling technique of Monte Carlo Cross-Validation was also implemented. We finalize comparing the results between the three methods.
id RCAP_5b913e776ce6cb04554f33272c7e2e5f
oai_identifier_str oai:run.unl.pt:10362/138511
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarumelasticnethigh-dimensional dataLASSOmulti nomial logisticregressionridgeDomínio/Área Científica::Ciências Naturais::MatemáticasAs a consequence of science and technology evolution,data dimensionality has been growing and, along side, the need to solve problems containing these complex types of data. Generically, a data analysis problem is termed high-dimensional when the amount of variables used to explain a certain phenomeno is higher than the number of instances of this same event in a dataset. In the context ofLinear and Generalized Linear Models,high-dimensional datasets provoke the non invertibility of the Fisher Information Matrix, which interferes with the estimation of the model parameters. Most regression models resort to intricate numerical and iterative methods for the assessment of the model coefficients,which often require the non-singularity oft he Fishe rInformation Matrix. Totackle the difficulties that emerge when the model’s Fisher Information Matrix is singular, a series of regularization methods have been used to analyze which predictor variables have significant linkage to the outcome and estimate their coefficients. Ridge,Least Absolute Shrink age and Selection Operator (LASSO) and Elastic Net methods were at the outset of regularization techniques and, because of this,they are seen as being eminently linked. The algorithms behind these three methods differ in few aspects, see mingly in such a way thatL ASSO overcomes Ridge’s and Elastic Net overcomes LASSO’s difficulties. To counter fraud connected to the mislabeling of product origin, regularization methods were applied to predict th elocation of origin of Ruditapes philippinarum,a species of saltwater clam that is commercially harvested for human consumption. The exploited dataset constitutes 30 clam samples, detailing information on 44 composition features,with the purpose of identifying which features distinguish between three geographic origins: Ria de Vigo, Ria de Aveiro, Estuário do Tejo, i.e, a classical Multinomial Logistic Regression problem. However, given the high-dimensionality of the dataset (number of variables higher than the number of observations),theestimationofthemodelcoefficientsposses,asexplainedabove,furtherdifficulties. To overcome this problem, the three touched upon regularization methods were applied to model the origin of the clams. Additionally, since datasets of only 30 samples challenge the process of model validation, the re-sampling technique of Monte Carlo Cross-Validation was also implemented. We finalize comparing the results between the three methods.O crescente desenvolvimento da ciência e tecnologia teve como resultado o aumento da dimensão de dados em diversas áreas científicas, acompanhado das complexidades que dados desta natureza trazem na aplicação de várias técnicas estatísticas. Um problema estatístico é designado como sendo de alta dimensão se o número de variáveis usadas para descrever um certo fenómeno for superior ao número de instâncias deste mesmo evento no conjunto de dados em análise. No contexto de Modelos Lineares e Modelos Lineares Generalizados, dados de alta dimensão induzem a não-invertibilidade da Matriz de Informação de Fisher, interferindo com a estimação dos parâmetros do modelo. Na maioria dos modelos de regressão, a estimação destes parâmetros recorre a métodos numéricos e iterativos intrincados, que implicam a determinação da inversa da Matriz de Informação de Fisher.Assim, para combater as complexidades causadas por conjuntos de dados de alta dimensão, foram desenvolvidos métodos de regularização com o objetivo de estimar os coeficientes das variáveis que têm maior ligação com o comportamento do fenómeno em estudo.Os métodosRidge, LASSO e ElasticNet foram estabelecidos no início do desenvolvimento de técnicas de regularização e, por esta razão,são vistos como estando conectados. Os algoritmos destes três métodos diferem em poucos aspetos,de tal modo que o método LASSO supera as dificuldades de Ridge, e ElasticNet as de LASSO. Para combater a fraude associada à falsificação da localização de origem de produtos, métodos de regularização foram aplicados a um conjunto de dados com o objetivo de prever o local de origem de Ruditapes philippinarum, uma espécie de amêijoa comercialmente colhida para consumo.Os dados constituem observações de 30 amêijoas, detalhando informação relativa a 44 elementos de composição, com o propósito de identificar quais os que melhor distinguem entre três origens geográficas: Ria de Vigo, Ria de Aveiro, Estuário do Tejo,i.e., um modelo de Regressão Logística Multinomial. Contudo, dado que o problema é de alta dimensão, a estimação dos coeficientes do modelo enfrenta complexidades.Assim, os três métodos de regularização mencionados foram aplicados. Adicionalmente,uma vez que conjuntos de dados com apenas 30 instâncias dificultam a validação do modelo,o método de Validação Cruzada de Monte Carlo foi implementado.Por fim,compararam-se os resultados obtidos pelos três métodos.Bispo, ReginaRUNSampaio, Clara Yokochi de Sousa2022-05-24T16:24:30Z2022-012022-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/138511enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:15:43Zoai:run.unl.pt:10362/138511Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:49:06.213719Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
title Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
spellingShingle Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
Sampaio, Clara Yokochi de Sousa
elasticnet
high-dimensional data
LASSO
multi nomial logisticregression
ridge
Domínio/Área Científica::Ciências Naturais::Matemáticas
title_short Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
title_full Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
title_fullStr Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
title_full_unstemmed Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
title_sort Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
author Sampaio, Clara Yokochi de Sousa
author_facet Sampaio, Clara Yokochi de Sousa
author_role author
dc.contributor.none.fl_str_mv Bispo, Regina
RUN
dc.contributor.author.fl_str_mv Sampaio, Clara Yokochi de Sousa
dc.subject.por.fl_str_mv elasticnet
high-dimensional data
LASSO
multi nomial logisticregression
ridge
Domínio/Área Científica::Ciências Naturais::Matemáticas
topic elasticnet
high-dimensional data
LASSO
multi nomial logisticregression
ridge
Domínio/Área Científica::Ciências Naturais::Matemáticas
description As a consequence of science and technology evolution,data dimensionality has been growing and, along side, the need to solve problems containing these complex types of data. Generically, a data analysis problem is termed high-dimensional when the amount of variables used to explain a certain phenomeno is higher than the number of instances of this same event in a dataset. In the context ofLinear and Generalized Linear Models,high-dimensional datasets provoke the non invertibility of the Fisher Information Matrix, which interferes with the estimation of the model parameters. Most regression models resort to intricate numerical and iterative methods for the assessment of the model coefficients,which often require the non-singularity oft he Fishe rInformation Matrix. Totackle the difficulties that emerge when the model’s Fisher Information Matrix is singular, a series of regularization methods have been used to analyze which predictor variables have significant linkage to the outcome and estimate their coefficients. Ridge,Least Absolute Shrink age and Selection Operator (LASSO) and Elastic Net methods were at the outset of regularization techniques and, because of this,they are seen as being eminently linked. The algorithms behind these three methods differ in few aspects, see mingly in such a way thatL ASSO overcomes Ridge’s and Elastic Net overcomes LASSO’s difficulties. To counter fraud connected to the mislabeling of product origin, regularization methods were applied to predict th elocation of origin of Ruditapes philippinarum,a species of saltwater clam that is commercially harvested for human consumption. The exploited dataset constitutes 30 clam samples, detailing information on 44 composition features,with the purpose of identifying which features distinguish between three geographic origins: Ria de Vigo, Ria de Aveiro, Estuário do Tejo, i.e, a classical Multinomial Logistic Regression problem. However, given the high-dimensionality of the dataset (number of variables higher than the number of observations),theestimationofthemodelcoefficientsposses,asexplainedabove,furtherdifficulties. To overcome this problem, the three touched upon regularization methods were applied to model the origin of the clams. Additionally, since datasets of only 30 samples challenge the process of model validation, the re-sampling technique of Monte Carlo Cross-Validation was also implemented. We finalize comparing the results between the three methods.
publishDate 2022
dc.date.none.fl_str_mv 2022-05-24T16:24:30Z
2022-01
2022-01-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/138511
url http://hdl.handle.net/10362/138511
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138090474799104