BioAutoML: Democratizing Machine Learning in Life Sciences

Detalhes bibliográficos
Autor(a) principal: Bonidia, Robson Parmezan
Data de Publicação: 2024
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/
Resumo: Recent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.
id USP_9f74ae3ef30e12f15d30e49461664403
oai_identifier_str oai:teses.usp.br:tde-01042024-092414
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling BioAutoML: Democratizing Machine Learning in Life SciencesBioAutoML: Democratizando Aprendizado de Máquina nas Ciências da VidaAutomated feature engineeringBioAutoMLBioAutoMLBiological sequencesDescritores matemáticosEngenharia de características automatizadaMathematical descriptorsMathFeatureMathFeatureMeta-aprendizadoMetalearningSequências biológicasRecent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.Avanços tecnológicos recentes permitiram uma expansão exponencial dos dados de sequências biológicas e a extração de informações significativas por meio de algoritmos de Aprendizado de Máquina (AM). Esse conhecimento aprimorou a compreensão dos mecanismos relacionados a várias doenças fatais, como o câncer e a COVID-19, contribuindo para o desenvolvimento de soluções inovadoras, como a edição de genes com base no CRISPR, vacinas contra o coronavírus e medicina de precisão. Esses avanços beneficiam nossa sociedade e economia, impactando diretamente a vida das pessoas em várias áreas, como cuidados de saúde, descoberta de medicamentos, análise forense e análise de alimentos. No entanto, abordagens de AM aplicadas a dados biológicos requerem características representativas, quantitativas e informativas. Necessariamente, uma vez que muitos algoritmos de AM só podem lidar com dados numéricos, as sequências precisam ser traduzidas em um vetor de características. Esse processo, conhecido como extração de características, é uma etapa fundamental para a elaboração de modelos de AM de alta qualidade em bioinformática, permitindo a etapa de engenharia de características, com o design e seleção de características adequadas. A engenharia de características, a seleção de algoritmos de AM e o ajuste de hiperparâmetros são frequentemente processos manuais e demorados, que requerem amplo conhecimento do domínio e são realizados manualmente por um especialista humano. Para lidar com esse problema, desenvolvemos um novo pacote, o BioAutoML, que executa automaticamente um pipeline de AM de ponta a ponta. O BioAutoML extrai características numéricas e informativas de bancos de dados de sequências biológicas, automatizando a seleção de características, a recomendação de algoritmos de AM e o ajuste de hiperparâmetros, usando o Aprendizado de Máquina Automatizado (AutoML). O BioAutoML possui dois componentes, divididos em quatro módulos: (1) engenharia de características automatizada (módulos de extração e seleção de características) e (2) Meta-Aprendizado (módulos de recomendação de algoritmos e ajuste de hiperparâmetros). Nossos resultados experimentais, ao avaliar a relevância de nossa proposta, indicam resultados robustos para diferentes domínios de problemas, como SARS-CoV-2, peptídeos anticancerígenos, sequências de HIV e RNAs não codificadores. De acordo com nossa revisão sistemática, nossa proposta é inovadora em comparação com estudos disponíveis na literatura, sendo o primeiro estudo a propor engenharia de características automatizada e metalearning para sequências biológicas. O BioAutoML tem um alto potencial para reduzir significativamente a expertise necessária para usar pipelines de AM, auxiliando os pesquisadores no combate a doenças, principalmente em países de baixa e média renda. Esta iniciativa pode oferecer aos biólogos, médicos, epidemiologistas e outras partes interessadas a oportunidade de utilizar amplamente essas técnicas para aprimorar a saúde e o bem-estar de suas comunidades.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deBonidia, Robson Parmezan2024-01-31info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-04-01T12:58:03Zoai:teses.usp.br:tde-01042024-092414Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212024-04-01T12:58:03Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv BioAutoML: Democratizing Machine Learning in Life Sciences
BioAutoML: Democratizando Aprendizado de Máquina nas Ciências da Vida
title BioAutoML: Democratizing Machine Learning in Life Sciences
spellingShingle BioAutoML: Democratizing Machine Learning in Life Sciences
Bonidia, Robson Parmezan
Automated feature engineering
BioAutoML
BioAutoML
Biological sequences
Descritores matemáticos
Engenharia de características automatizada
Mathematical descriptors
MathFeature
MathFeature
Meta-aprendizado
Metalearning
Sequências biológicas
title_short BioAutoML: Democratizing Machine Learning in Life Sciences
title_full BioAutoML: Democratizing Machine Learning in Life Sciences
title_fullStr BioAutoML: Democratizing Machine Learning in Life Sciences
title_full_unstemmed BioAutoML: Democratizing Machine Learning in Life Sciences
title_sort BioAutoML: Democratizing Machine Learning in Life Sciences
author Bonidia, Robson Parmezan
author_facet Bonidia, Robson Parmezan
author_role author
dc.contributor.none.fl_str_mv Carvalho, André Carlos Ponce de Leon Ferreira de
dc.contributor.author.fl_str_mv Bonidia, Robson Parmezan
dc.subject.por.fl_str_mv Automated feature engineering
BioAutoML
BioAutoML
Biological sequences
Descritores matemáticos
Engenharia de características automatizada
Mathematical descriptors
MathFeature
MathFeature
Meta-aprendizado
Metalearning
Sequências biológicas
topic Automated feature engineering
BioAutoML
BioAutoML
Biological sequences
Descritores matemáticos
Engenharia de características automatizada
Mathematical descriptors
MathFeature
MathFeature
Meta-aprendizado
Metalearning
Sequências biológicas
description Recent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.
publishDate 2024
dc.date.none.fl_str_mv 2024-01-31
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090272777207808