JurisBERT: Transformer-based model for embedding legal texts

Detalhes bibliográficos
Autor(a) principal: Charles Felipe Oliveira Viegas
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFMS
Texto Completo: https://repositorio.ufms.br/handle/123456789/5119
Resumo: We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.
id UFMS_a146cbb2ff37b5a348b98eff58e13616
oai_identifier_str oai:repositorio.ufms.br:123456789/5119
network_acronym_str UFMS
network_name_str Repositório Institucional da UFMS
repository_id_str 2124
spelling 2022-09-26T13:35:35Z2022-09-26T13:35:35Z2022https://repositorio.ufms.br/handle/123456789/5119We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.Neste trabalho é proposta uma nova extensão do BERT (Bidirectional Encoder Representations from Transformers), denominada JurisBERT, aplicada na tarefa de Similaridade Semântica Textual (SST) com melhorias consideráveis de rapidez, de precisão e de necessidade reduzida de recursos computacionais em relação a outras abordagens. O JurisBERT foi treinado do zero com textos de domínio específicos para lidar com leis, doutrinas e precedentes, e tem melhor precisão em relação a outros modelos BERT, que foi a principal contribuição deste trabalho. Além disso, nossa abordagem considera o conceito de sub-línguagem, ou seja, um modelo pré-treinado em um idioma (Português Brasileiro) passa por um refinamento (fine-tuning) para melhor atender a um domínio específico, no nosso caso, o jurídico. A fim de validar a abordagem com dados reais, o JurisBERT cria e emprega 24 mil pares de ementas com grau de similaridade variando entre 0 e 3, extraídas de mecanismos de busca disponíveis nos sites dos tribunais brasileiros. Nossos experimentos demonstraram que o JurisBERT é melhor do que outros modelos em quatro cenários: BERT multi-lingual e BERTimbau sem ajuste fino em cerca de 22% e 12% de F1, respectivamente; e com refinamento em torno de 20% e 4%. Além disso, nossa abordagem reduziu em 5 vezes a etapa de pré-treinamento, além de usar hardware acessível, ou seja, arquiteturas GPGPU de baixo custo. Esse resultado demonstra que nem sempre modelos pré-treinados, como BERT multi-lingual e BERTimbau, são a melhor solução. Assim, provamos que treinar o BERT do zero com textos específicos de domínio tem maior precisão e menor tempo de treinamento do que modelos pré-treinados de domínio geral. O código fonte está disponível em https://github.com/juridics/brazilian-legal-text-dataset.Fundação Universidade Federal de Mato Grosso do SulUFMSBrasilRetrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERTJurisBERT: Transformer-based model for embedding legal textsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisRenato Porfirio IshiiCharles Felipe Oliveira Viegasinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMSinstname:Universidade Federal de Mato Grosso do Sul (UFMS)instacron:UFMSORIGINALJurisBERT__Transformer_based_model_for_embedding_legal_texts.pdfJurisBERT__Transformer_based_model_for_embedding_legal_texts.pdfapplication/pdf857762https://repositorio.ufms.br/bitstream/123456789/5119/-1/JurisBERT__Transformer_based_model_for_embedding_legal_texts.pdf11b5dc6f31112d6e055fe6818622ac26MD5-1123456789/51192022-09-26 09:35:37.055oai:repositorio.ufms.br:123456789/5119Repositório InstitucionalPUBhttps://repositorio.ufms.br/oai/requestri.prograd@ufms.bropendoar:21242022-09-26T13:35:37Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)false
dc.title.pt_BR.fl_str_mv JurisBERT: Transformer-based model for embedding legal texts
title JurisBERT: Transformer-based model for embedding legal texts
spellingShingle JurisBERT: Transformer-based model for embedding legal texts
Charles Felipe Oliveira Viegas
Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT
title_short JurisBERT: Transformer-based model for embedding legal texts
title_full JurisBERT: Transformer-based model for embedding legal texts
title_fullStr JurisBERT: Transformer-based model for embedding legal texts
title_full_unstemmed JurisBERT: Transformer-based model for embedding legal texts
title_sort JurisBERT: Transformer-based model for embedding legal texts
author Charles Felipe Oliveira Viegas
author_facet Charles Felipe Oliveira Viegas
author_role author
dc.contributor.advisor1.fl_str_mv Renato Porfirio Ishii
dc.contributor.author.fl_str_mv Charles Felipe Oliveira Viegas
contributor_str_mv Renato Porfirio Ishii
dc.subject.por.fl_str_mv Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT
topic Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT
description We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.
publishDate 2022
dc.date.accessioned.fl_str_mv 2022-09-26T13:35:35Z
dc.date.available.fl_str_mv 2022-09-26T13:35:35Z
dc.date.issued.fl_str_mv 2022
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://repositorio.ufms.br/handle/123456789/5119
url https://repositorio.ufms.br/handle/123456789/5119
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Fundação Universidade Federal de Mato Grosso do Sul
dc.publisher.initials.fl_str_mv UFMS
dc.publisher.country.fl_str_mv Brasil
publisher.none.fl_str_mv Fundação Universidade Federal de Mato Grosso do Sul
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMS
instname:Universidade Federal de Mato Grosso do Sul (UFMS)
instacron:UFMS
instname_str Universidade Federal de Mato Grosso do Sul (UFMS)
instacron_str UFMS
institution UFMS
reponame_str Repositório Institucional da UFMS
collection Repositório Institucional da UFMS
bitstream.url.fl_str_mv https://repositorio.ufms.br/bitstream/123456789/5119/-1/JurisBERT__Transformer_based_model_for_embedding_legal_texts.pdf
bitstream.checksum.fl_str_mv 11b5dc6f31112d6e055fe6818622ac26
bitstream.checksumAlgorithm.fl_str_mv MD5
repository.name.fl_str_mv Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)
repository.mail.fl_str_mv ri.prograd@ufms.br
_version_ 1807552826518274048