Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study

Detalhes bibliográficos
Autor(a) principal: Barbon, Rafael Silva
Data de Publicação: 2022
Outros Autores: Akabane, Ademar Takeo
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional PUC-Campinas
Texto Completo: http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187
Resumo: The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.
id PUC_CAMP-5_a67cf78e1d6ae32b822a081d56265b57
oai_identifier_str oai:repositorio.sis.puc-campinas.edu.br:123456789/17187
network_acronym_str PUC_CAMP-5
network_name_str Repositório Institucional PUC-Campinas
repository_id_str
spelling Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case StudyRumo a técnicas de aprendizagem por transferência - BERT, DistilBERT, BERTimbau e DistilBERTimbau para classificação automática de texto de diferentes idiomas: um estudo de casobig datapre-trained modelBERTDistilBERTBERTimbauDistilBERTimbautransformerbased machine learningThe Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.Não recebi financiamentoSensorsPontifícia Universidade Católica de Campinas (PUC-Campinas)Barbon, Rafael SilvaAkabane, Ademar Takeo2024-03-18T14:50:50Z2024-03-18T14:50:50Z2022-10-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/1718797138912188129636781874728187325enginfo:eu-repo/semantics/openAccessreponame:Repositório Institucional PUC-Campinasinstname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)instacron:PUC_CAMP2024-03-18T14:50:50Zoai:repositorio.sis.puc-campinas.edu.br:123456789/17187Repositório InstitucionalPRIhttps://repositorio.sis.puc-campinas.edu.br/oai/requestsbi.bibliotecadigital@puc-campinas.edu.bropendoar:2024-03-18T14:50:50Repositório Institucional PUC-Campinas - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)false
dc.title.none.fl_str_mv Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
Rumo a técnicas de aprendizagem por transferência - BERT, DistilBERT, BERTimbau e DistilBERTimbau para classificação automática de texto de diferentes idiomas: um estudo de caso
title Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
spellingShingle Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
Barbon, Rafael Silva
big data
pre-trained model
BERT
DistilBERT
BERTimbau
DistilBERTimbau
transformerbased machine learning
title_short Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_full Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_fullStr Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_full_unstemmed Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_sort Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
author Barbon, Rafael Silva
author_facet Barbon, Rafael Silva
Akabane, Ademar Takeo
author_role author
author2 Akabane, Ademar Takeo
author2_role author
dc.contributor.none.fl_str_mv Pontifícia Universidade Católica de Campinas (PUC-Campinas)
dc.contributor.author.fl_str_mv Barbon, Rafael Silva
Akabane, Ademar Takeo
dc.subject.por.fl_str_mv big data
pre-trained model
BERT
DistilBERT
BERTimbau
DistilBERTimbau
transformerbased machine learning
topic big data
pre-trained model
BERT
DistilBERT
BERTimbau
DistilBERTimbau
transformerbased machine learning
description The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.
publishDate 2022
dc.date.none.fl_str_mv 2022-10-26
2024-03-18T14:50:50Z
2024-03-18T14:50:50Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187
9713891218812963
6781874728187325
url http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187
identifier_str_mv 9713891218812963
6781874728187325
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Sensors
publisher.none.fl_str_mv Sensors
dc.source.none.fl_str_mv reponame:Repositório Institucional PUC-Campinas
instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron:PUC_CAMP
instname_str Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron_str PUC_CAMP
institution PUC_CAMP
reponame_str Repositório Institucional PUC-Campinas
collection Repositório Institucional PUC-Campinas
repository.name.fl_str_mv Repositório Institucional PUC-Campinas - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
repository.mail.fl_str_mv sbi.bibliotecadigital@puc-campinas.edu.br
_version_ 1798415782749667328