Zgli
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Outros Autores: | , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/149625 |
Resumo: | Funding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors. |
id |
RCAP_0d02a19d0329c81c847043b2db3e7a2a |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/149625 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
ZgliA Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritisclustering by compressionclustering techniquesCompLearnKolmogorov complexitynormalized compression distanceZgliAnalytical ChemistryInformation SystemsBiochemistryAtomic and Molecular Physics, and OpticsInstrumentationElectrical and Electronic EngineeringFunding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors.The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.NOVA Medical School|Faculdade de Ciências Médicas (NMS|FCM)Comprehensive Health Research Centre (CHRC) - pólo NMSCentro de Estudos de Doenças Crónicas (CEDOC)RUNAzevedo, DiogoMaria Rodrigues, AnaCanhão, HelenaCarvalho, Alexandra M.Souto, André2023-02-23T22:21:09Z2023-022023-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10362/149625eng1424-8220PURE: 53992210https://doi.org/10.3390/s23031219info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:31:31Zoai:run.unl.pt:10362/149625Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:53:48.729423Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Zgli A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis |
title |
Zgli |
spellingShingle |
Zgli Azevedo, Diogo clustering by compression clustering techniques CompLearn Kolmogorov complexity normalized compression distance Zgli Analytical Chemistry Information Systems Biochemistry Atomic and Molecular Physics, and Optics Instrumentation Electrical and Electronic Engineering |
title_short |
Zgli |
title_full |
Zgli |
title_fullStr |
Zgli |
title_full_unstemmed |
Zgli |
title_sort |
Zgli |
author |
Azevedo, Diogo |
author_facet |
Azevedo, Diogo Maria Rodrigues, Ana Canhão, Helena Carvalho, Alexandra M. Souto, André |
author_role |
author |
author2 |
Maria Rodrigues, Ana Canhão, Helena Carvalho, Alexandra M. Souto, André |
author2_role |
author author author author |
dc.contributor.none.fl_str_mv |
NOVA Medical School|Faculdade de Ciências Médicas (NMS|FCM) Comprehensive Health Research Centre (CHRC) - pólo NMS Centro de Estudos de Doenças Crónicas (CEDOC) RUN |
dc.contributor.author.fl_str_mv |
Azevedo, Diogo Maria Rodrigues, Ana Canhão, Helena Carvalho, Alexandra M. Souto, André |
dc.subject.por.fl_str_mv |
clustering by compression clustering techniques CompLearn Kolmogorov complexity normalized compression distance Zgli Analytical Chemistry Information Systems Biochemistry Atomic and Molecular Physics, and Optics Instrumentation Electrical and Electronic Engineering |
topic |
clustering by compression clustering techniques CompLearn Kolmogorov complexity normalized compression distance Zgli Analytical Chemistry Information Systems Biochemistry Atomic and Molecular Physics, and Optics Instrumentation Electrical and Electronic Engineering |
description |
Funding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-02-23T22:21:09Z 2023-02 2023-02-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/149625 |
url |
http://hdl.handle.net/10362/149625 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
1424-8220 PURE: 53992210 https://doi.org/10.3390/s23031219 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138128090365952 |