Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS |
Texto Completo: | http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187 |
Resumo: | The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets. |
id |
PCAM_a67cf78e1d6ae32b822a081d56265b57 |
---|---|
oai_identifier_str |
oai:repositorio.sis.puc-campinas.edu.br:123456789/17187 |
network_acronym_str |
PCAM |
network_name_str |
Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS |
repository_id_str |
4886 |
spelling |
Barbon, Rafael SilvaAkabane, Ademar TakeoPontifícia Universidade Católica de Campinas (PUC-Campinas)2024-03-18T14:50:50Z2024-03-18T14:50:50Z2022-10-26http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/1718797138912188129636781874728187325The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets.Não recebi financiamentoengSensorsbig datapre-trained modelBERTDistilBERTBERTimbauDistilBERTimbautransformerbased machine learningTowards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case StudyRumo a técnicas de aprendizagem por transferência - BERT, DistilBERT, BERTimbau e DistilBERTimbau para classificação automática de texto de diferentes idiomas: um estudo de casoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da PUC_CAMPINASinstname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)instacron:PUC_CAMPNão se aplicaSistemas de Infraestrutura UrbanaOnlineNão se aplicaLICENSElicense.txtlicense.txttext/plain; charset=utf-80http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/2/license.txtd41d8cd98f00b204e9800998ecf8427eMD52ORIGINALBarbon, Rafael Silva - Towards Transfer Learning.pdfBarbon, Rafael Silva - Towards Transfer Learning.pdfapplication/pdf566274http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/1/Barbon%2c%20Rafael%20Silva%20-%20Towards%20Transfer%20Learning.pdf036818fe794057539729acdf5936f433MD51123456789/171872024-03-18 11:50:50.894oai:repositorio.sis.puc-campinas.edu.br:123456789/17187Biblioteca Digital de Teses e Dissertaçõeshttp://tede.bibliotecadigital.puc-campinas.edu.br:8080/jspui/http://tede.bibliotecadigital.puc-campinas.edu.br:8080/oai/requestsbi.bibliotecadigital@puc-campinas.edu.b||sbi.bibliotecadigital@puc-campinas.edu.bropendoar:48862024-03-18T14:50:50Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS)false |
dc.title.pt_BR.fl_str_mv |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
dc.title.alternative.pt_BR.fl_str_mv |
Rumo a técnicas de aprendizagem por transferência - BERT, DistilBERT, BERTimbau e DistilBERTimbau para classificação automática de texto de diferentes idiomas: um estudo de caso |
title |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
spellingShingle |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study Barbon, Rafael Silva big data pre-trained model BERT DistilBERT BERTimbau DistilBERTimbau transformerbased machine learning |
title_short |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
title_full |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
title_fullStr |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
title_full_unstemmed |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
title_sort |
Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study |
author |
Barbon, Rafael Silva |
author_facet |
Barbon, Rafael Silva Akabane, Ademar Takeo |
author_role |
author |
author2 |
Akabane, Ademar Takeo |
author2_role |
author |
dc.contributor.institution.pt_BR.fl_str_mv |
Pontifícia Universidade Católica de Campinas (PUC-Campinas) |
dc.contributor.author.fl_str_mv |
Barbon, Rafael Silva Akabane, Ademar Takeo |
dc.subject.por.fl_str_mv |
big data pre-trained model BERT DistilBERT BERTimbau DistilBERTimbau transformerbased machine learning |
topic |
big data pre-trained model BERT DistilBERT BERTimbau DistilBERTimbau transformerbased machine learning |
description |
The Internet of Things is a paradigm that interconnects several smart devices through the internet to provide ubiquitous services to users. This paradigm and Web 2.0 platforms generate countless amounts of textual data. Thus, a significant challenge in this context is automatically performing text classification. State-of-the-art outcomes have recently been obtained by employing language models trained from scratch on corpora made up from news online to handle text classification better. A language model that we can highlight is BERT (Bidirectional Encoder Representations from Transformers) and also DistilBERT is a pre-trained smaller general-purpose language representation model. In this context, through a case study, we propose performing the text classification task with two previously mentioned models for two languages (English and Brazilian Portuguese) in different datasets. The results show that DistilBERT’s training time for English and Brazilian Portuguese was about 45% faster than its larger counterpart, it was also 40% smaller, and preserves about 96% of language comprehension skills for balanced datasets. |
publishDate |
2022 |
dc.date.issued.fl_str_mv |
2022-10-26 |
dc.date.accessioned.fl_str_mv |
2024-03-18T14:50:50Z |
dc.date.available.fl_str_mv |
2024-03-18T14:50:50Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187 |
dc.identifier.lattes.pt_BR.fl_str_mv |
9713891218812963 6781874728187325 |
url |
http://repositorio.sis.puc-campinas.edu.br/xmlui/handle/123456789/17187 |
identifier_str_mv |
9713891218812963 6781874728187325 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Sensors |
publisher.none.fl_str_mv |
Sensors |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS instname:Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) instacron:PUC_CAMP |
instname_str |
Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) |
instacron_str |
PUC_CAMP |
institution |
PUC_CAMP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS |
collection |
Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS |
bitstream.url.fl_str_mv |
http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/2/license.txt http://repositorio.sis.puc-campinas.edu.br/xmlui/bitstream/123456789/17187/1/Barbon%2c%20Rafael%20Silva%20-%20Towards%20Transfer%20Learning.pdf |
bitstream.checksum.fl_str_mv |
d41d8cd98f00b204e9800998ecf8427e 036818fe794057539729acdf5936f433 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da PUC_CAMPINAS - Pontifícia Universidade Católica de Campinas (PUC-CAMPINAS) |
repository.mail.fl_str_mv |
sbi.bibliotecadigital@puc-campinas.edu.b||sbi.bibliotecadigital@puc-campinas.edu.br |
_version_ |
1796790716952739840 |