BioAutoML: Democratizing Machine Learning in Life Sciences

Bonidia, Robson Parmezan

BioAutoML: Democratizing Machine Learning in Life Sciences

Detalhes bibliográficos
Autor(a) principal:	Bonidia, Robson Parmezan
Data de Publicação:	2024
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da USP
Texto Completo:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/
Resumo:	Recent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.

Metadados do item

id	USP_9f74ae3ef30e12f15d30e49461664403
oai_identifier_str	oai:teses.usp.br:tde-01042024-092414
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	BioAutoML: Democratizing Machine Learning in Life SciencesBioAutoML: Democratizando Aprendizado de Máquina nas Ciências da VidaAutomated feature engineeringBioAutoMLBioAutoMLBiological sequencesDescritores matemáticosEngenharia de características automatizadaMathematical descriptorsMathFeatureMathFeatureMeta-aprendizadoMetalearningSequências biológicasRecent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.Avanços tecnológicos recentes permitiram uma expansão exponencial dos dados de sequências biológicas e a extração de informações significativas por meio de algoritmos de Aprendizado de Máquina (AM). Esse conhecimento aprimorou a compreensão dos mecanismos relacionados a várias doenças fatais, como o câncer e a COVID-19, contribuindo para o desenvolvimento de soluções inovadoras, como a edição de genes com base no CRISPR, vacinas contra o coronavírus e medicina de precisão. Esses avanços beneficiam nossa sociedade e economia, impactando diretamente a vida das pessoas em várias áreas, como cuidados de saúde, descoberta de medicamentos, análise forense e análise de alimentos. No entanto, abordagens de AM aplicadas a dados biológicos requerem características representativas, quantitativas e informativas. Necessariamente, uma vez que muitos algoritmos de AM só podem lidar com dados numéricos, as sequências precisam ser traduzidas em um vetor de características. Esse processo, conhecido como extração de características, é uma etapa fundamental para a elaboração de modelos de AM de alta qualidade em bioinformática, permitindo a etapa de engenharia de características, com o design e seleção de características adequadas. A engenharia de características, a seleção de algoritmos de AM e o ajuste de hiperparâmetros são frequentemente processos manuais e demorados, que requerem amplo conhecimento do domínio e são realizados manualmente por um especialista humano. Para lidar com esse problema, desenvolvemos um novo pacote, o BioAutoML, que executa automaticamente um pipeline de AM de ponta a ponta. O BioAutoML extrai características numéricas e informativas de bancos de dados de sequências biológicas, automatizando a seleção de características, a recomendação de algoritmos de AM e o ajuste de hiperparâmetros, usando o Aprendizado de Máquina Automatizado (AutoML). O BioAutoML possui dois componentes, divididos em quatro módulos: (1) engenharia de características automatizada (módulos de extração e seleção de características) e (2) Meta-Aprendizado (módulos de recomendação de algoritmos e ajuste de hiperparâmetros). Nossos resultados experimentais, ao avaliar a relevância de nossa proposta, indicam resultados robustos para diferentes domínios de problemas, como SARS-CoV-2, peptídeos anticancerígenos, sequências de HIV e RNAs não codificadores. De acordo com nossa revisão sistemática, nossa proposta é inovadora em comparação com estudos disponíveis na literatura, sendo o primeiro estudo a propor engenharia de características automatizada e metalearning para sequências biológicas. O BioAutoML tem um alto potencial para reduzir significativamente a expertise necessária para usar pipelines de AM, auxiliando os pesquisadores no combate a doenças, principalmente em países de baixa e média renda. Esta iniciativa pode oferecer aos biólogos, médicos, epidemiologistas e outras partes interessadas a oportunidade de utilizar amplamente essas técnicas para aprimorar a saúde e o bem-estar de suas comunidades.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deBonidia, Robson Parmezan2024-01-31info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-04-01T12:58:03Zoai:teses.usp.br:tde-01042024-092414Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212024-04-01T12:58:03Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	BioAutoML: Democratizing Machine Learning in Life Sciences BioAutoML: Democratizando Aprendizado de Máquina nas Ciências da Vida
title	BioAutoML: Democratizing Machine Learning in Life Sciences
spellingShingle	BioAutoML: Democratizing Machine Learning in Life Sciences Bonidia, Robson Parmezan Automated feature engineering BioAutoML BioAutoML Biological sequences Descritores matemáticos Engenharia de características automatizada Mathematical descriptors MathFeature MathFeature Meta-aprendizado Metalearning Sequências biológicas
title_short	BioAutoML: Democratizing Machine Learning in Life Sciences
title_full	BioAutoML: Democratizing Machine Learning in Life Sciences
title_fullStr	BioAutoML: Democratizing Machine Learning in Life Sciences
title_full_unstemmed	BioAutoML: Democratizing Machine Learning in Life Sciences
title_sort	BioAutoML: Democratizing Machine Learning in Life Sciences
author	Bonidia, Robson Parmezan
author_facet	Bonidia, Robson Parmezan
author_role	author
dc.contributor.none.fl_str_mv	Carvalho, André Carlos Ponce de Leon Ferreira de
dc.contributor.author.fl_str_mv	Bonidia, Robson Parmezan
dc.subject.por.fl_str_mv	Automated feature engineering BioAutoML BioAutoML Biological sequences Descritores matemáticos Engenharia de características automatizada Mathematical descriptors MathFeature MathFeature Meta-aprendizado Metalearning Sequências biológicas
topic	Automated feature engineering BioAutoML BioAutoML Biological sequences Descritores matemáticos Engenharia de características automatizada Mathematical descriptors MathFeature MathFeature Meta-aprendizado Metalearning Sequências biológicas
description	Recent technological advances allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting peoples lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches applied to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for the elaboration of high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge, and performed manually by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules, (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyperparameter tuning modules). Our experimental results, assessing the relevance of our proposal, indicate robust results for different problem domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. According to our systematic review, our proposal is innovative compared to available studies in the literature, being the first study to propose automated feature engineering and metalearning for biological sequences. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.
publishDate	2024
dc.date.none.fl_str_mv	2024-01-31
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-01042024-092414/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1815256576076808192

BioAutoML: Democratizing Machine Learning in Life Sciences

Registros relacionados