Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study

Detalhes bibliográficos
Autor(a) principal: Barbon, Rafael Silva
Data de Publicação: 2022
Outros Autores: Akabane, Ademar Takeo
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS
Texto Completo: http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187
Resumo: The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.
id PCAM_a67cf78e1d6ae32b822a081d56265b57
oai_identifier_str oai:repositorio.sis.puc-campinas.edu.br:123456789/17187
network_acronym_str PCAM
network_name_str Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS
repository_id_str 4886
spelling Barbon, Rafael SilvaAkabane, Ademar TakeoPontifícia Universidade Católica de Campinas (PUC-Campinas)2024-03-18T14:50:50Z2024-03-18T14:50:50Z2022-10-26http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/1718797138912188129636781874728187325The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.Não recebi financiamentoengSensorsbig datapre-trained modelBERTDistilBERTBERTimbauDistilBERTimbautransformerbased machine learningTowards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case StudyRumo a técnicas de aprendizagem por transferência - BERT, DistilBERT, BERTimbau e DistilBERTimbau para classificação automática de texto de diferentes idiomas: um estudo de casoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da PUC_CAMPINASinstname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)instacron:PUC_CAMPNão se aplicaSistemas de Infraestrutura UrbanaOnlineNão se aplicaLICENSElicense.txtlicense.txttext/plain; charset=utf-80http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/2/license.txtd41d8cd98f00b204e9800998ecf8427eMD52ORIGINALBarbon, Rafael Silva - Towards Transfer Learning.pdfBarbon, Rafael Silva - Towards Transfer Learning.pdfapplication/pdf566274http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/1/Barbon%2c%20Rafael%20Silva%20-%20Towards%20Transfer%20Learning.pdf036818fe794057539729acdf5936f433MD51123456789/171872024-03-18 11:50:50.894oai:repositorio.sis.puc-campinas.edu.br:123456789/17187Biblioteca Digital de Teses e Dissertaçõeshttp://tede.bibliotecadigital.puc-campinas.edu.br:8080/jspui/http://tede.bibliotecadigital.puc-campinas.edu.br:8080/oai/requestsbi.bibliotecadigital@puc-campinas.edu.b||sbi.bibliotecadigital@puc-campinas.edu.bropendoar:48862024-03-18T14:50:50Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)false
dc.title.pt_BR.fl_str_mv Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
dc.title.alternative.pt_BR.fl_str_mv Rumo a técnicas de aprendizagem por transferência - BERT, DistilBERT, BERTimbau e DistilBERTimbau para classificação automática de texto de diferentes idiomas: um estudo de caso
title Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
spellingShingle Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
Barbon, Rafael Silva
big data
pre-trained model
BERT
DistilBERT
BERTimbau
DistilBERTimbau
transformerbased machine learning
title_short Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_full Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_fullStr Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_full_unstemmed Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
title_sort Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
author Barbon, Rafael Silva
author_facet Barbon, Rafael Silva
Akabane, Ademar Takeo
author_role author
author2 Akabane, Ademar Takeo
author2_role author
dc.contributor.institution.pt_BR.fl_str_mv Pontifícia Universidade Católica de Campinas (PUC-Campinas)
dc.contributor.author.fl_str_mv Barbon, Rafael Silva
Akabane, Ademar Takeo
dc.subject.por.fl_str_mv big data
pre-trained model
BERT
DistilBERT
BERTimbau
DistilBERTimbau
transformerbased machine learning
topic big data
pre-trained model
BERT
DistilBERT
BERTimbau
DistilBERTimbau
transformerbased machine learning
description The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.
publishDate 2022
dc.date.issued.fl_str_mv 2022-10-26
dc.date.accessioned.fl_str_mv 2024-03-18T14:50:50Z
dc.date.available.fl_str_mv 2024-03-18T14:50:50Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187
dc.identifier.lattes.pt_BR.fl_str_mv 9713891218812963
6781874728187325
url http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187
identifier_str_mv 9713891218812963
6781874728187325
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Sensors
publisher.none.fl_str_mv Sensors
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS
instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron:PUC_CAMP
instname_str Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
instacron_str PUC_CAMP
institution PUC_CAMP
reponame_str Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS
collection Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS
bitstream.url.fl_str_mv http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/2/license.txt
http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/1/Barbon%2c%20Rafael%20Silva%20-%20Towards%20Transfer%20Learning.pdf
bitstream.checksum.fl_str_mv d41d8cd98f00b204e9800998ecf8427e
036818fe794057539729acdf5936f433
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)
repository.mail.fl_str_mv sbi.bibliotecadigital@puc-campinas.edu.b||sbi.bibliotecadigital@puc-campinas.edu.br
_version_ 1796790716952739840