Zgli

Detalhes bibliográficos
Autor(a) principal: Azevedo, Diogo
Data de Publicação: 2023
Outros Autores: Maria Rodrigues, Ana, Canhão, Helena, Carvalho, Alexandra M., Souto, André
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/149625
Resumo: Funding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors.
id RCAP_0d02a19d0329c81c847043b2db3e7a2a
oai_identifier_str oai:run.unl.pt:10362/149625
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling ZgliA Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritisclustering by compressionclustering techniquesCompLearnKolmogorov complexitynormalized compression distanceZgliAnalytical ChemistryInformation SystemsBiochemistryAtomic and Molecular Physics, and OpticsInstrumentationElectrical and Electronic EngineeringFunding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors.The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.NOVA Medical School|Faculdade de Ciências Médicas (NMS|FCM)Comprehensive Health Research Centre (CHRC) - pólo NMSCentro de Estudos de Doenças Crónicas (CEDOC)RUNAzevedo, DiogoMaria Rodrigues, AnaCanhão, HelenaCarvalho, Alexandra M.Souto, André2023-02-23T22:21:09Z2023-022023-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10362/149625eng1424-8220PURE: 53992210https://doi.org/10.3390/s23031219info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:31:31Zoai:run.unl.pt:10362/149625Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:53:48.729423Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Zgli
A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis
title Zgli
spellingShingle Zgli
Azevedo, Diogo
clustering by compression
clustering techniques
CompLearn
Kolmogorov complexity
normalized compression distance
Zgli
Analytical Chemistry
Information Systems
Biochemistry
Atomic and Molecular Physics, and Optics
Instrumentation
Electrical and Electronic Engineering
title_short Zgli
title_full Zgli
title_fullStr Zgli
title_full_unstemmed Zgli
title_sort Zgli
author Azevedo, Diogo
author_facet Azevedo, Diogo
Maria Rodrigues, Ana
Canhão, Helena
Carvalho, Alexandra M.
Souto, André
author_role author
author2 Maria Rodrigues, Ana
Canhão, Helena
Carvalho, Alexandra M.
Souto, André
author2_role author
author
author
author
dc.contributor.none.fl_str_mv NOVA Medical School|Faculdade de Ciências Médicas (NMS|FCM)
Comprehensive Health Research Centre (CHRC) - pólo NMS
Centro de Estudos de Doenças Crónicas (CEDOC)
RUN
dc.contributor.author.fl_str_mv Azevedo, Diogo
Maria Rodrigues, Ana
Canhão, Helena
Carvalho, Alexandra M.
Souto, André
dc.subject.por.fl_str_mv clustering by compression
clustering techniques
CompLearn
Kolmogorov complexity
normalized compression distance
Zgli
Analytical Chemistry
Information Systems
Biochemistry
Atomic and Molecular Physics, and Optics
Instrumentation
Electrical and Electronic Engineering
topic clustering by compression
clustering techniques
CompLearn
Kolmogorov complexity
normalized compression distance
Zgli
Analytical Chemistry
Information Systems
Biochemistry
Atomic and Molecular Physics, and Optics
Instrumentation
Electrical and Electronic Engineering
description Funding Information: The authors acknowledge Fundação para a Ciência e Tecnologia, LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020 and Instituto de Telecomunicações Research Unit, ref. UIDB/50008/2020, and UIDP/50008/2020. The authors also acknowledge the Project PREDICT (PTDC/CCI-CIF/29877/2017), funded by Fundo Europeu de Desenvolvimento Regional (FEDER), through Programa Operacional Regional LISBOA (LISBOA2020), and by national funds, through Fundacção para a Ciência e Tecnologia (FCT), and projects MATISSE (DSAIPA/DS/0026/2019), MONET (PTDC/CCI-BIO/4180/2020) and SmartGlauco (PTDC/CTM-REF/2679/2020). Publisher Copyright: © 2023 by the authors.
publishDate 2023
dc.date.none.fl_str_mv 2023-02-23T22:21:09Z
2023-02
2023-02-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/149625
url http://hdl.handle.net/10362/149625
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 1424-8220
PURE: 53992210
https://doi.org/10.3390/s23031219
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138128090365952