Methodology to identify a gene expression signature by merging microarray datasets

Detalhes bibliográficos
Autor(a) principal: Fajarda, Olga
Data de Publicação: 2023
Outros Autores: Almeida, João Rafael, Duarte-Pereira, Sara, Silva, Raquel M., Oliveira, José Luís
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.14/40884
Resumo: A vast number of microarray datasets have been produced as a way to identify differentially expressed genes and gene expression signatures. A better understanding of these biological processes can help in the diagnosis and prognosis of diseases, as well as in the therapeutic response to drugs. However, most of the available datasets are composed of a reduced number of samples, leading to low statistical, predictive and generalization power. One way to overcome this problem is by merging several microarray datasets into a single dataset, which is typically a challenging task. Statistical methods or supervised machine learning algorithms are usually used to determine gene expression signatures. Nevertheless, statistical methods require an arbitrary threshold to be defined, and supervised machine learning methods can be ineffective when applied to high-dimensional datasets like microarrays. We propose a methodology to identify gene expression signatures by merging microarray datasets. This methodology uses statistical methods to obtain several sets of differentially expressed genes and uses supervised machine learning algorithms to select the gene expression signature. This methodology was validated using two distinct research applications: one using heart failure and the other using autism spectrum disorder microarray datasets. For the first, we obtained a gene expression signature composed of 117 genes, with a classification accuracy of approximately 98%. For the second use case, we obtained a gene expression signature composed of 79 genes, with a classification accuracy of approximately 82%. This methodology was implemented in R language and is available, under the MIT licence, at https://github.com/bioinformatics-ua/MicroGES.
id RCAP_a54cf679c9ea4bd1ffb514a639b83318
oai_identifier_str oai:repositorio.ucp.pt:10400.14/40884
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Methodology to identify a gene expression signature by merging microarray datasetsAutism spectrum disorderGene expression signatureHeart failureLSVMMicroarray dataNeural networkRandom forestA vast number of microarray datasets have been produced as a way to identify differentially expressed genes and gene expression signatures. A better understanding of these biological processes can help in the diagnosis and prognosis of diseases, as well as in the therapeutic response to drugs. However, most of the available datasets are composed of a reduced number of samples, leading to low statistical, predictive and generalization power. One way to overcome this problem is by merging several microarray datasets into a single dataset, which is typically a challenging task. Statistical methods or supervised machine learning algorithms are usually used to determine gene expression signatures. Nevertheless, statistical methods require an arbitrary threshold to be defined, and supervised machine learning methods can be ineffective when applied to high-dimensional datasets like microarrays. We propose a methodology to identify gene expression signatures by merging microarray datasets. This methodology uses statistical methods to obtain several sets of differentially expressed genes and uses supervised machine learning algorithms to select the gene expression signature. This methodology was validated using two distinct research applications: one using heart failure and the other using autism spectrum disorder microarray datasets. For the first, we obtained a gene expression signature composed of 117 genes, with a classification accuracy of approximately 98%. For the second use case, we obtained a gene expression signature composed of 79 genes, with a classification accuracy of approximately 82%. This methodology was implemented in R language and is available, under the MIT licence, at https://github.com/bioinformatics-ua/MicroGES.Veritati - Repositório Institucional da Universidade Católica PortuguesaFajarda, OlgaAlmeida, João RafaelDuarte-Pereira, SaraSilva, Raquel M.Oliveira, José Luís2023-04-19T13:07:02Z2023-062023-06-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.14/40884eng0010-482510.1016/j.compbiomed.2023.1068678515212934837060770000982862600001info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-12T17:46:27Zoai:repositorio.ucp.pt:10400.14/40884Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T18:33:34.192655Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Methodology to identify a gene expression signature by merging microarray datasets
title Methodology to identify a gene expression signature by merging microarray datasets
spellingShingle Methodology to identify a gene expression signature by merging microarray datasets
Fajarda, Olga
Autism spectrum disorder
Gene expression signature
Heart failure
LSVM
Microarray data
Neural network
Random forest
title_short Methodology to identify a gene expression signature by merging microarray datasets
title_full Methodology to identify a gene expression signature by merging microarray datasets
title_fullStr Methodology to identify a gene expression signature by merging microarray datasets
title_full_unstemmed Methodology to identify a gene expression signature by merging microarray datasets
title_sort Methodology to identify a gene expression signature by merging microarray datasets
author Fajarda, Olga
author_facet Fajarda, Olga
Almeida, João Rafael
Duarte-Pereira, Sara
Silva, Raquel M.
Oliveira, José Luís
author_role author
author2 Almeida, João Rafael
Duarte-Pereira, Sara
Silva, Raquel M.
Oliveira, José Luís
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Veritati - Repositório Institucional da Universidade Católica Portuguesa
dc.contributor.author.fl_str_mv Fajarda, Olga
Almeida, João Rafael
Duarte-Pereira, Sara
Silva, Raquel M.
Oliveira, José Luís
dc.subject.por.fl_str_mv Autism spectrum disorder
Gene expression signature
Heart failure
LSVM
Microarray data
Neural network
Random forest
topic Autism spectrum disorder
Gene expression signature
Heart failure
LSVM
Microarray data
Neural network
Random forest
description A vast number of microarray datasets have been produced as a way to identify differentially expressed genes and gene expression signatures. A better understanding of these biological processes can help in the diagnosis and prognosis of diseases, as well as in the therapeutic response to drugs. However, most of the available datasets are composed of a reduced number of samples, leading to low statistical, predictive and generalization power. One way to overcome this problem is by merging several microarray datasets into a single dataset, which is typically a challenging task. Statistical methods or supervised machine learning algorithms are usually used to determine gene expression signatures. Nevertheless, statistical methods require an arbitrary threshold to be defined, and supervised machine learning methods can be ineffective when applied to high-dimensional datasets like microarrays. We propose a methodology to identify gene expression signatures by merging microarray datasets. This methodology uses statistical methods to obtain several sets of differentially expressed genes and uses supervised machine learning algorithms to select the gene expression signature. This methodology was validated using two distinct research applications: one using heart failure and the other using autism spectrum disorder microarray datasets. For the first, we obtained a gene expression signature composed of 117 genes, with a classification accuracy of approximately 98%. For the second use case, we obtained a gene expression signature composed of 79 genes, with a classification accuracy of approximately 82%. This methodology was implemented in R language and is available, under the MIT licence, at https://github.com/bioinformatics-ua/MicroGES.
publishDate 2023
dc.date.none.fl_str_mv 2023-04-19T13:07:02Z
2023-06
2023-06-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.14/40884
url http://hdl.handle.net/10400.14/40884
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 0010-4825
10.1016/j.compbiomed.2023.106867
85152129348
37060770
000982862600001
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799132062318329856