Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Outros Autores: | , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://hdl.handle.net/1822/85507 |
Resumo: | Predictive models based on empirical similarity are instrumental in biology and data science, where the premise is to measure the likeness of one observation with others in the same dataset. Biological datasets often encompass data that can be categorized. When using empirical similarity-based predictive models, two strategies for handling categorical covariates exist. The first strategy retains categorical covariates in their original form, applying distance measures and allocating weights to each covariate. In contrast, the second strategy creates binary variables, representing each variable level independently, and computes similarity measures solely through the Euclidean distance. This study performs a sensitivity analysis of these two strategies using computational simulations, and applies the results to a biological context. We use a linear regression model as a reference point, and consider two methods for estimating the model parameters, alongside exponential and fractional inverse similarity functions. The sensitivity is evaluated by determining the coefficient of variation of the parameter estimators across the three models as a measure of relative variability. Our results suggest that the first strategy excels over the second one in effectively dealing with categorical variables, and offers greater parsimony due to the use of fewer parameters. |
id |
RCAP_a6a6e2128ea2e5694444f8aa0b31a729 |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/85507 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributesBiological dataCoefficient of variationData scienceDistance measuresEstimation methodsPredictive modelingMonte Carlo simulationSimilarity functionsCiências Naturais::MatemáticasParcerias para a implementação dos objetivosPredictive models based on empirical similarity are instrumental in biology and data science, where the premise is to measure the likeness of one observation with others in the same dataset. Biological datasets often encompass data that can be categorized. When using empirical similarity-based predictive models, two strategies for handling categorical covariates exist. The first strategy retains categorical covariates in their original form, applying distance measures and allocating weights to each covariate. In contrast, the second strategy creates binary variables, representing each variable level independently, and computes similarity measures solely through the Euclidean distance. This study performs a sensitivity analysis of these two strategies using computational simulations, and applies the results to a biological context. We use a linear regression model as a reference point, and consider two methods for estimating the model parameters, alongside exponential and fractional inverse similarity functions. The sensitivity is evaluated by determining the coefficient of variation of the parameter estimators across the three models as a measure of relative variability. Our results suggest that the first strategy excels over the second one in effectively dealing with categorical variables, and offers greater parsimony due to the use of fewer parameters.ANCD -Agenția Națională pentru Cercetare și Dezvoltare(UIDB/00013/2020)MDPIUniversidade do MinhoSanchez, Jeniffer D.Rêgo, Leandro C.Ospina, RaydonalLeiva, VíctorChesneau, ChristopheCastro, Cecília2023-072023-07-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/85507eng2079-773710.3390/biology12070959https://www.mdpi.com/2079-7737/12/7/959info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-08-12T01:17:30Zoai:repositorium.sdum.uminho.pt:1822/85507Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T19:00:42.764442Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
title |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
spellingShingle |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes Sanchez, Jeniffer D. Biological data Coefficient of variation Data science Distance measures Estimation methods Predictive modeling Monte Carlo simulation Similarity functions Ciências Naturais::Matemáticas Parcerias para a implementação dos objetivos |
title_short |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
title_full |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
title_fullStr |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
title_full_unstemmed |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
title_sort |
Similarity-based predictive models: Sensitivity analysis and a biological application with multi-attributes |
author |
Sanchez, Jeniffer D. |
author_facet |
Sanchez, Jeniffer D. Rêgo, Leandro C. Ospina, Raydonal Leiva, Víctor Chesneau, Christophe Castro, Cecília |
author_role |
author |
author2 |
Rêgo, Leandro C. Ospina, Raydonal Leiva, Víctor Chesneau, Christophe Castro, Cecília |
author2_role |
author author author author author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Sanchez, Jeniffer D. Rêgo, Leandro C. Ospina, Raydonal Leiva, Víctor Chesneau, Christophe Castro, Cecília |
dc.subject.por.fl_str_mv |
Biological data Coefficient of variation Data science Distance measures Estimation methods Predictive modeling Monte Carlo simulation Similarity functions Ciências Naturais::Matemáticas Parcerias para a implementação dos objetivos |
topic |
Biological data Coefficient of variation Data science Distance measures Estimation methods Predictive modeling Monte Carlo simulation Similarity functions Ciências Naturais::Matemáticas Parcerias para a implementação dos objetivos |
description |
Predictive models based on empirical similarity are instrumental in biology and data science, where the premise is to measure the likeness of one observation with others in the same dataset. Biological datasets often encompass data that can be categorized. When using empirical similarity-based predictive models, two strategies for handling categorical covariates exist. The first strategy retains categorical covariates in their original form, applying distance measures and allocating weights to each covariate. In contrast, the second strategy creates binary variables, representing each variable level independently, and computes similarity measures solely through the Euclidean distance. This study performs a sensitivity analysis of these two strategies using computational simulations, and applies the results to a biological context. We use a linear regression model as a reference point, and consider two methods for estimating the model parameters, alongside exponential and fractional inverse similarity functions. The sensitivity is evaluated by determining the coefficient of variation of the parameter estimators across the three models as a measure of relative variability. Our results suggest that the first strategy excels over the second one in effectively dealing with categorical variables, and offers greater parsimony due to the use of fewer parameters. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-07 2023-07-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1822/85507 |
url |
https://hdl.handle.net/1822/85507 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
2079-7737 10.3390/biology12070959 https://www.mdpi.com/2079-7737/12/7/959 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
MDPI |
publisher.none.fl_str_mv |
MDPI |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799132403056246784 |