JurisBERT: Transformer-based model for embedding legal texts
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFMS |
Texto Completo: | https://repositorio.ufms.br/handle/123456789/5119 |
Resumo: | We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset. |
id |
UFMS_a146cbb2ff37b5a348b98eff58e13616 |
---|---|
oai_identifier_str |
oai:repositorio.ufms.br:123456789/5119 |
network_acronym_str |
UFMS |
network_name_str |
Repositório Institucional da UFMS |
repository_id_str |
2124 |
spelling |
2022-09-26T13:35:35Z2022-09-26T13:35:35Z2022https://repositorio.ufms.br/handle/123456789/5119We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.Neste trabalho é proposta uma nova extensão do BERT (Bidirectional Encoder Representations from Transformers), denominada JurisBERT, aplicada na tarefa de Similaridade Semântica Textual (SST) com melhorias consideráveis de rapidez, de precisão e de necessidade reduzida de recursos computacionais em relação a outras abordagens. O JurisBERT foi treinado do zero com textos de domínio específicos para lidar com leis, doutrinas e precedentes, e tem melhor precisão em relação a outros modelos BERT, que foi a principal contribuição deste trabalho. Além disso, nossa abordagem considera o conceito de sub-línguagem, ou seja, um modelo pré-treinado em um idioma (Português Brasileiro) passa por um refinamento (fine-tuning) para melhor atender a um domínio específico, no nosso caso, o jurídico. A fim de validar a abordagem com dados reais, o JurisBERT cria e emprega 24 mil pares de ementas com grau de similaridade variando entre 0 e 3, extraídas de mecanismos de busca disponíveis nos sites dos tribunais brasileiros. Nossos experimentos demonstraram que o JurisBERT é melhor do que outros modelos em quatro cenários: BERT multi-lingual e BERTimbau sem ajuste fino em cerca de 22% e 12% de F1, respectivamente; e com refinamento em torno de 20% e 4%. Além disso, nossa abordagem reduziu em 5 vezes a etapa de pré-treinamento, além de usar hardware acessível, ou seja, arquiteturas GPGPU de baixo custo. Esse resultado demonstra que nem sempre modelos pré-treinados, como BERT multi-lingual e BERTimbau, são a melhor solução. Assim, provamos que treinar o BERT do zero com textos específicos de domínio tem maior precisão e menor tempo de treinamento do que modelos pré-treinados de domínio geral. O código fonte está disponível em https://github.com/juridics/brazilian-legal-text-dataset.Fundação Universidade Federal de Mato Grosso do SulUFMSBrasilRetrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERTJurisBERT: Transformer-based model for embedding legal textsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisRenato Porfirio IshiiCharles Felipe Oliveira Viegasinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMSinstname:Universidade Federal de Mato Grosso do Sul (UFMS)instacron:UFMSORIGINALJurisBERT__Transformer_based_model_for_embedding_legal_texts.pdfJurisBERT__Transformer_based_model_for_embedding_legal_texts.pdfapplication/pdf857762https://repositorio.ufms.br/bitstream/123456789/5119/-1/JurisBERT__Transformer_based_model_for_embedding_legal_texts.pdf11b5dc6f31112d6e055fe6818622ac26MD5-1123456789/51192022-09-26 09:35:37.055oai:repositorio.ufms.br:123456789/5119Repositório InstitucionalPUBhttps://repositorio.ufms.br/oai/requestri.prograd@ufms.bropendoar:21242022-09-26T13:35:37Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)false |
dc.title.pt_BR.fl_str_mv |
JurisBERT: Transformer-based model for embedding legal texts |
title |
JurisBERT: Transformer-based model for embedding legal texts |
spellingShingle |
JurisBERT: Transformer-based model for embedding legal texts Charles Felipe Oliveira Viegas Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT |
title_short |
JurisBERT: Transformer-based model for embedding legal texts |
title_full |
JurisBERT: Transformer-based model for embedding legal texts |
title_fullStr |
JurisBERT: Transformer-based model for embedding legal texts |
title_full_unstemmed |
JurisBERT: Transformer-based model for embedding legal texts |
title_sort |
JurisBERT: Transformer-based model for embedding legal texts |
author |
Charles Felipe Oliveira Viegas |
author_facet |
Charles Felipe Oliveira Viegas |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Renato Porfirio Ishii |
dc.contributor.author.fl_str_mv |
Charles Felipe Oliveira Viegas |
contributor_str_mv |
Renato Porfirio Ishii |
dc.subject.por.fl_str_mv |
Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT |
topic |
Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT |
description |
We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset. |
publishDate |
2022 |
dc.date.accessioned.fl_str_mv |
2022-09-26T13:35:35Z |
dc.date.available.fl_str_mv |
2022-09-26T13:35:35Z |
dc.date.issued.fl_str_mv |
2022 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufms.br/handle/123456789/5119 |
url |
https://repositorio.ufms.br/handle/123456789/5119 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Fundação Universidade Federal de Mato Grosso do Sul |
dc.publisher.initials.fl_str_mv |
UFMS |
dc.publisher.country.fl_str_mv |
Brasil |
publisher.none.fl_str_mv |
Fundação Universidade Federal de Mato Grosso do Sul |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMS instname:Universidade Federal de Mato Grosso do Sul (UFMS) instacron:UFMS |
instname_str |
Universidade Federal de Mato Grosso do Sul (UFMS) |
instacron_str |
UFMS |
institution |
UFMS |
reponame_str |
Repositório Institucional da UFMS |
collection |
Repositório Institucional da UFMS |
bitstream.url.fl_str_mv |
https://repositorio.ufms.br/bitstream/123456789/5119/-1/JurisBERT__Transformer_based_model_for_embedding_legal_texts.pdf |
bitstream.checksum.fl_str_mv |
11b5dc6f31112d6e055fe6818622ac26 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS) |
repository.mail.fl_str_mv |
ri.prograd@ufms.br |
_version_ |
1807552826518274048 |