Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/138511 |
Resumo: | As a consequence of science and technology evolution,data dimensionality has been growing and, along side, the need to solve problems containing these complex types of data. Generically, a data analysis problem is termed high-dimensional when the amount of variables used to explain a certain phenomeno is higher than the number of instances of this same event in a dataset. In the context ofLinear and Generalized Linear Models,high-dimensional datasets provoke the non invertibility of the Fisher Information Matrix, which interferes with the estimation of the model parameters. Most regression models resort to intricate numerical and iterative methods for the assessment of the model coefficients,which often require the non-singularity oft he Fishe rInformation Matrix. Totackle the difficulties that emerge when the model’s Fisher Information Matrix is singular, a series of regularization methods have been used to analyze which predictor variables have significant linkage to the outcome and estimate their coefficients. Ridge,Least Absolute Shrink age and Selection Operator (LASSO) and Elastic Net methods were at the outset of regularization techniques and, because of this,they are seen as being eminently linked. The algorithms behind these three methods differ in few aspects, see mingly in such a way thatL ASSO overcomes Ridge’s and Elastic Net overcomes LASSO’s difficulties. To counter fraud connected to the mislabeling of product origin, regularization methods were applied to predict th elocation of origin of Ruditapes philippinarum,a species of saltwater clam that is commercially harvested for human consumption. The exploited dataset constitutes 30 clam samples, detailing information on 44 composition features,with the purpose of identifying which features distinguish between three geographic origins: Ria de Vigo, Ria de Aveiro, Estuário do Tejo, i.e, a classical Multinomial Logistic Regression problem. However, given the high-dimensionality of the dataset (number of variables higher than the number of observations),theestimationofthemodelcoefficientsposses,asexplainedabove,furtherdifficulties. To overcome this problem, the three touched upon regularization methods were applied to model the origin of the clams. Additionally, since datasets of only 30 samples challenge the process of model validation, the re-sampling technique of Monte Carlo Cross-Validation was also implemented. We finalize comparing the results between the three methods. |
id |
RCAP_5b913e776ce6cb04554f33272c7e2e5f |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/138511 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarumelasticnethigh-dimensional dataLASSOmulti nomial logisticregressionridgeDomínio/Área Científica::Ciências Naturais::MatemáticasAs a consequence of science and technology evolution,data dimensionality has been growing and, along side, the need to solve problems containing these complex types of data. Generically, a data analysis problem is termed high-dimensional when the amount of variables used to explain a certain phenomeno is higher than the number of instances of this same event in a dataset. In the context ofLinear and Generalized Linear Models,high-dimensional datasets provoke the non invertibility of the Fisher Information Matrix, which interferes with the estimation of the model parameters. Most regression models resort to intricate numerical and iterative methods for the assessment of the model coefficients,which often require the non-singularity oft he Fishe rInformation Matrix. Totackle the difficulties that emerge when the model’s Fisher Information Matrix is singular, a series of regularization methods have been used to analyze which predictor variables have significant linkage to the outcome and estimate their coefficients. Ridge,Least Absolute Shrink age and Selection Operator (LASSO) and Elastic Net methods were at the outset of regularization techniques and, because of this,they are seen as being eminently linked. The algorithms behind these three methods differ in few aspects, see mingly in such a way thatL ASSO overcomes Ridge’s and Elastic Net overcomes LASSO’s difficulties. To counter fraud connected to the mislabeling of product origin, regularization methods were applied to predict th elocation of origin of Ruditapes philippinarum,a species of saltwater clam that is commercially harvested for human consumption. The exploited dataset constitutes 30 clam samples, detailing information on 44 composition features,with the purpose of identifying which features distinguish between three geographic origins: Ria de Vigo, Ria de Aveiro, Estuário do Tejo, i.e, a classical Multinomial Logistic Regression problem. However, given the high-dimensionality of the dataset (number of variables higher than the number of observations),theestimationofthemodelcoefficientsposses,asexplainedabove,furtherdifficulties. To overcome this problem, the three touched upon regularization methods were applied to model the origin of the clams. Additionally, since datasets of only 30 samples challenge the process of model validation, the re-sampling technique of Monte Carlo Cross-Validation was also implemented. We finalize comparing the results between the three methods.O crescente desenvolvimento da ciência e tecnologia teve como resultado o aumento da dimensão de dados em diversas áreas científicas, acompanhado das complexidades que dados desta natureza trazem na aplicação de várias técnicas estatísticas. Um problema estatístico é designado como sendo de alta dimensão se o número de variáveis usadas para descrever um certo fenómeno for superior ao número de instâncias deste mesmo evento no conjunto de dados em análise. No contexto de Modelos Lineares e Modelos Lineares Generalizados, dados de alta dimensão induzem a não-invertibilidade da Matriz de Informação de Fisher, interferindo com a estimação dos parâmetros do modelo. Na maioria dos modelos de regressão, a estimação destes parâmetros recorre a métodos numéricos e iterativos intrincados, que implicam a determinação da inversa da Matriz de Informação de Fisher.Assim, para combater as complexidades causadas por conjuntos de dados de alta dimensão, foram desenvolvidos métodos de regularização com o objetivo de estimar os coeficientes das variáveis que têm maior ligação com o comportamento do fenómeno em estudo.Os métodosRidge, LASSO e ElasticNet foram estabelecidos no início do desenvolvimento de técnicas de regularização e, por esta razão,são vistos como estando conectados. Os algoritmos destes três métodos diferem em poucos aspetos,de tal modo que o método LASSO supera as dificuldades de Ridge, e ElasticNet as de LASSO. Para combater a fraude associada à falsificação da localização de origem de produtos, métodos de regularização foram aplicados a um conjunto de dados com o objetivo de prever o local de origem de Ruditapes philippinarum, uma espécie de amêijoa comercialmente colhida para consumo.Os dados constituem observações de 30 amêijoas, detalhando informação relativa a 44 elementos de composição, com o propósito de identificar quais os que melhor distinguem entre três origens geográficas: Ria de Vigo, Ria de Aveiro, Estuário do Tejo,i.e., um modelo de Regressão Logística Multinomial. Contudo, dado que o problema é de alta dimensão, a estimação dos coeficientes do modelo enfrenta complexidades.Assim, os três métodos de regularização mencionados foram aplicados. Adicionalmente,uma vez que conjuntos de dados com apenas 30 instâncias dificultam a validação do modelo,o método de Validação Cruzada de Monte Carlo foi implementado.Por fim,compararam-se os resultados obtidos pelos três métodos.Bispo, ReginaRUNSampaio, Clara Yokochi de Sousa2022-05-24T16:24:30Z2022-012022-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/138511enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:15:43Zoai:run.unl.pt:10362/138511Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:49:06.213719Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
title |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
spellingShingle |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum Sampaio, Clara Yokochi de Sousa elasticnet high-dimensional data LASSO multi nomial logisticregression ridge Domínio/Área Científica::Ciências Naturais::Matemáticas |
title_short |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
title_full |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
title_fullStr |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
title_full_unstemmed |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
title_sort |
Application of regularization methods to high-dimensional data as tool for predicting the geographic origin of the saltwater clam Ruditapes philippinarum |
author |
Sampaio, Clara Yokochi de Sousa |
author_facet |
Sampaio, Clara Yokochi de Sousa |
author_role |
author |
dc.contributor.none.fl_str_mv |
Bispo, Regina RUN |
dc.contributor.author.fl_str_mv |
Sampaio, Clara Yokochi de Sousa |
dc.subject.por.fl_str_mv |
elasticnet high-dimensional data LASSO multi nomial logisticregression ridge Domínio/Área Científica::Ciências Naturais::Matemáticas |
topic |
elasticnet high-dimensional data LASSO multi nomial logisticregression ridge Domínio/Área Científica::Ciências Naturais::Matemáticas |
description |
As a consequence of science and technology evolution,data dimensionality has been growing and, along side, the need to solve problems containing these complex types of data. Generically, a data analysis problem is termed high-dimensional when the amount of variables used to explain a certain phenomeno is higher than the number of instances of this same event in a dataset. In the context ofLinear and Generalized Linear Models,high-dimensional datasets provoke the non invertibility of the Fisher Information Matrix, which interferes with the estimation of the model parameters. Most regression models resort to intricate numerical and iterative methods for the assessment of the model coefficients,which often require the non-singularity oft he Fishe rInformation Matrix. Totackle the difficulties that emerge when the model’s Fisher Information Matrix is singular, a series of regularization methods have been used to analyze which predictor variables have significant linkage to the outcome and estimate their coefficients. Ridge,Least Absolute Shrink age and Selection Operator (LASSO) and Elastic Net methods were at the outset of regularization techniques and, because of this,they are seen as being eminently linked. The algorithms behind these three methods differ in few aspects, see mingly in such a way thatL ASSO overcomes Ridge’s and Elastic Net overcomes LASSO’s difficulties. To counter fraud connected to the mislabeling of product origin, regularization methods were applied to predict th elocation of origin of Ruditapes philippinarum,a species of saltwater clam that is commercially harvested for human consumption. The exploited dataset constitutes 30 clam samples, detailing information on 44 composition features,with the purpose of identifying which features distinguish between three geographic origins: Ria de Vigo, Ria de Aveiro, Estuário do Tejo, i.e, a classical Multinomial Logistic Regression problem. However, given the high-dimensionality of the dataset (number of variables higher than the number of observations),theestimationofthemodelcoefficientsposses,asexplainedabove,furtherdifficulties. To overcome this problem, the three touched upon regularization methods were applied to model the origin of the clams. Additionally, since datasets of only 30 samples challenge the process of model validation, the re-sampling technique of Monte Carlo Cross-Validation was also implemented. We finalize comparing the results between the three methods. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-05-24T16:24:30Z 2022-01 2022-01-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/138511 |
url |
http://hdl.handle.net/10362/138511 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138090474799104 |