BioAutoML: Democratizing Machine Learning in Life Sciences
Autor(a) principal: | |
---|---|
Data de Publicação: | 2024 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/ |
Resumo: | Recent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities. |
id |
USP_9f74ae3ef30e12f15d30e49461664403 |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-01042024-092414 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
BioAutoML: Democratizing Machine Learning in Life SciencesBioAutoML: Democratizando Aprendizado de Máquina nas Ciências da VidaAutomated feature engineeringBioAutoMLBioAutoMLBiological sequencesDescritores matemáticosEngenharia de características automatizadaMathematical descriptorsMathFeatureMathFeatureMeta-aprendizadoMetalearningSequências biológicasRecent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.Avanços tecnológicos recentes permitiram uma expansão exponencial dos dados de sequências biológicas e a extração de informações significativas por meio de algoritmos de Aprendizado de Máquina (AM). Esse conhecimento aprimorou a compreensão dos mecanismos relacionados a várias doenças fatais, como o câncer e a COVID-19, contribuindo para o desenvolvimento de soluções inovadoras, como a edição de genes com base no CRISPR, vacinas contra o coronavírus e medicina de precisão. Esses avanços beneficiam nossa sociedade e economia, impactando diretamente a vida das pessoas em várias áreas, como cuidados de saúde, descoberta de medicamentos, análise forense e análise de alimentos. No entanto, abordagens de AM aplicadas a dados biológicos requerem características representativas, quantitativas e informativas. Necessariamente, uma vez que muitos algoritmos de AM só podem lidar com dados numéricos, as sequências precisam ser traduzidas em um vetor de características. Esse processo, conhecido como extração de características, é uma etapa fundamental para a elaboração de modelos de AM de alta qualidade em bioinformática, permitindo a etapa de engenharia de características, com o design e seleção de características adequadas. A engenharia de características, a seleção de algoritmos de AM e o ajuste de hiperparâmetros são frequentemente processos manuais e demorados, que requerem amplo conhecimento do domínio e são realizados manualmente por um especialista humano. Para lidar com esse problema, desenvolvemos um novo pacote, o BioAutoML, que executa automaticamente um pipeline de AM de ponta a ponta. O BioAutoML extrai características numéricas e informativas de bancos de dados de sequências biológicas, automatizando a seleção de características, a recomendação de algoritmos de AM e o ajuste de hiperparâmetros, usando o Aprendizado de Máquina Automatizado (AutoML). O BioAutoML possui dois componentes, divididos em quatro módulos: (1) engenharia de características automatizada (módulos de extração e seleção de características) e (2) Meta-Aprendizado (módulos de recomendação de algoritmos e ajuste de hiperparâmetros). Nossos resultados experimentais, ao avaliar a relevância de nossa proposta, indicam resultados robustos para diferentes domínios de problemas, como SARS-CoV-2, peptídeos anticancerígenos, sequências de HIV e RNAs não codificadores. De acordo com nossa revisão sistemática, nossa proposta é inovadora em comparação com estudos disponíveis na literatura, sendo o primeiro estudo a propor engenharia de características automatizada e metalearning para sequências biológicas. O BioAutoML tem um alto potencial para reduzir significativamente a expertise necessária para usar pipelines de AM, auxiliando os pesquisadores no combate a doenças, principalmente em países de baixa e média renda. Esta iniciativa pode oferecer aos biólogos, médicos, epidemiologistas e outras partes interessadas a oportunidade de utilizar amplamente essas técnicas para aprimorar a saúde e o bem-estar de suas comunidades.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deBonidia, Robson Parmezan2024-01-31info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-04-01T12:58:03Zoai:teses.usp.br:tde-01042024-092414Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212024-04-01T12:58:03Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
BioAutoML: Democratizing Machine Learning in Life Sciences BioAutoML: Democratizando Aprendizado de Máquina nas Ciências da Vida |
title |
BioAutoML: Democratizing Machine Learning in Life Sciences |
spellingShingle |
BioAutoML: Democratizing Machine Learning in Life Sciences Bonidia, Robson Parmezan Automated feature engineering BioAutoML BioAutoML Biological sequences Descritores matemáticos Engenharia de características automatizada Mathematical descriptors MathFeature MathFeature Meta-aprendizado Metalearning Sequências biológicas |
title_short |
BioAutoML: Democratizing Machine Learning in Life Sciences |
title_full |
BioAutoML: Democratizing Machine Learning in Life Sciences |
title_fullStr |
BioAutoML: Democratizing Machine Learning in Life Sciences |
title_full_unstemmed |
BioAutoML: Democratizing Machine Learning in Life Sciences |
title_sort |
BioAutoML: Democratizing Machine Learning in Life Sciences |
author |
Bonidia, Robson Parmezan |
author_facet |
Bonidia, Robson Parmezan |
author_role |
author |
dc.contributor.none.fl_str_mv |
Carvalho, André Carlos Ponce de Leon Ferreira de |
dc.contributor.author.fl_str_mv |
Bonidia, Robson Parmezan |
dc.subject.por.fl_str_mv |
Automated feature engineering BioAutoML BioAutoML Biological sequences Descritores matemáticos Engenharia de características automatizada Mathematical descriptors MathFeature MathFeature Meta-aprendizado Metalearning Sequências biológicas |
topic |
Automated feature engineering BioAutoML BioAutoML Biological sequences Descritores matemáticos Engenharia de características automatizada Mathematical descriptors MathFeature MathFeature Meta-aprendizado Metalearning Sequências biológicas |
description |
Recent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024-01-31 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/ |
url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815256576076808192 |