MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors

Detalhes bibliográficos
Autor(a) principal: Bonidia, Robson P.
Data de Publicação: 2022
Outros Autores: Domingues, Douglas S. [UNESP], Sanches, Danilo S., de Carvalho, André C P L F
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da UNESP
Texto Completo: http://dx.doi.org/10.1093/bib/bbab434
http://hdl.handle.net/11449/223373
Resumo: One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
id UNSP_8716e57979452a8a8c551984bd9568f3
oai_identifier_str oai:repositorio.unesp.br:11449/223373
network_acronym_str UNSP
network_name_str Repositório Institucional da UNESP
repository_id_str 2946
spelling MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptorsbiological sequencesfeature extractionGUI-based platformmathematical descriptorspackagepythonOne of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Institute of Mathematics and Computer Sciences University of São PauloGroup of Genomics and Transcriptomes in Plants Institute of Biosciences São Paulo State University (UNESP)Department of Computer Science Federal University of Technology - Paraná UTFPRGroup of Genomics and Transcriptomes in Plants Institute of Biosciences São Paulo State University (UNESP)FAPESP: 2013/07375-0Universidade de São Paulo (USP)Universidade Estadual Paulista (UNESP)UTFPRBonidia, Robson P.Domingues, Douglas S. [UNESP]Sanches, Danilo S.de Carvalho, André C P L F2022-04-28T19:50:15Z2022-04-28T19:50:15Z2022-01-17info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://dx.doi.org/10.1093/bib/bbab434Briefings in bioinformatics, v. 23, n. 1, 2022.1477-4054http://hdl.handle.net/11449/22337310.1093/bib/bbab4342-s2.0-85123814372Scopusreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengBriefings in bioinformaticsinfo:eu-repo/semantics/openAccess2022-04-28T19:50:15Zoai:repositorio.unesp.br:11449/223373Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-05T15:13:11.979581Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
title MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
spellingShingle MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
Bonidia, Robson P.
biological sequences
feature extraction
GUI-based platform
mathematical descriptors
package
python
title_short MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
title_full MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
title_fullStr MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
title_full_unstemmed MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
title_sort MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors
author Bonidia, Robson P.
author_facet Bonidia, Robson P.
Domingues, Douglas S. [UNESP]
Sanches, Danilo S.
de Carvalho, André C P L F
author_role author
author2 Domingues, Douglas S. [UNESP]
Sanches, Danilo S.
de Carvalho, André C P L F
author2_role author
author
author
dc.contributor.none.fl_str_mv Universidade de São Paulo (USP)
Universidade Estadual Paulista (UNESP)
UTFPR
dc.contributor.author.fl_str_mv Bonidia, Robson P.
Domingues, Douglas S. [UNESP]
Sanches, Danilo S.
de Carvalho, André C P L F
dc.subject.por.fl_str_mv biological sequences
feature extraction
GUI-based platform
mathematical descriptors
package
python
topic biological sequences
feature extraction
GUI-based platform
mathematical descriptors
package
python
description One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
publishDate 2022
dc.date.none.fl_str_mv 2022-04-28T19:50:15Z
2022-04-28T19:50:15Z
2022-01-17
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://dx.doi.org/10.1093/bib/bbab434
Briefings in bioinformatics, v. 23, n. 1, 2022.
1477-4054
http://hdl.handle.net/11449/223373
10.1093/bib/bbab434
2-s2.0-85123814372
url http://dx.doi.org/10.1093/bib/bbab434
http://hdl.handle.net/11449/223373
identifier_str_mv Briefings in bioinformatics, v. 23, n. 1, 2022.
1477-4054
10.1093/bib/bbab434
2-s2.0-85123814372
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Briefings in bioinformatics
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv Scopus
reponame:Repositório Institucional da UNESP
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Repositório Institucional da UNESP
collection Repositório Institucional da UNESP
repository.name.fl_str_mv Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv
_version_ 1808128482219130880