JurisBERT: Transformer-based model for embedding legal texts

Charles Felipe Oliveira Viegas

JurisBERT: Transformer-based model for embedding legal texts

Detalhes bibliográficos
Autor(a) principal:	Charles Felipe Oliveira Viegas
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da UFMS
Texto Completo:	https://repositorio.ufms.br/handle/123456789/5119
Resumo:	We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.

Metadados do item

id	UFMS_a146cbb2ff37b5a348b98eff58e13616
oai_identifier_str	oai:repositorio.ufms.br:123456789/5119
network_acronym_str	UFMS
network_name_str	Repositório Institucional da UFMS
repository_id_str	2124
spelling	2022-09-26T13:35:35Z2022-09-26T13:35:35Z2022https://repositorio.ufms.br/handle/123456789/5119We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.Neste trabalho é proposta uma nova extensão do BERT (Bidirectional Encoder Representations from Transformers), denominada JurisBERT, aplicada na tarefa de Similaridade Semântica Textual (SST) com melhorias consideráveis de rapidez, de precisão e de necessidade reduzida de recursos computacionais em relação a outras abordagens. O JurisBERT foi treinado do zero com textos de domínio específicos para lidar com leis, doutrinas e precedentes, e tem melhor precisão em relação a outros modelos BERT, que foi a principal contribuição deste trabalho. Além disso, nossa abordagem considera o conceito de sub-línguagem, ou seja, um modelo pré-treinado em um idioma (Português Brasileiro) passa por um refinamento (fine-tuning) para melhor atender a um domínio específico, no nosso caso, o jurídico. A fim de validar a abordagem com dados reais, o JurisBERT cria e emprega 24 mil pares de ementas com grau de similaridade variando entre 0 e 3, extraídas de mecanismos de busca disponíveis nos sites dos tribunais brasileiros. Nossos experimentos demonstraram que o JurisBERT é melhor do que outros modelos em quatro cenários: BERT multi-lingual e BERTimbau sem ajuste fino em cerca de 22% e 12% de F1, respectivamente; e com refinamento em torno de 20% e 4%. Além disso, nossa abordagem reduziu em 5 vezes a etapa de pré-treinamento, além de usar hardware acessível, ou seja, arquiteturas GPGPU de baixo custo. Esse resultado demonstra que nem sempre modelos pré-treinados, como BERT multi-lingual e BERTimbau, são a melhor solução. Assim, provamos que treinar o BERT do zero com textos específicos de domínio tem maior precisão e menor tempo de treinamento do que modelos pré-treinados de domínio geral. O código fonte está disponível em https://github.com/juridics/brazilian-legal-text-dataset.Fundação Universidade Federal de Mato Grosso do SulUFMSBrasilRetrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERTJurisBERT: Transformer-based model for embedding legal textsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisRenato Porfirio IshiiCharles Felipe Oliveira Viegasinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMSinstname:Universidade Federal de Mato Grosso do Sul (UFMS)instacron:UFMSORIGINALJurisBERT__Transformer_based_model_for_embedding_legal_texts.pdfJurisBERT__Transformer_based_model_for_embedding_legal_texts.pdfapplication/pdf857762https://repositorio.ufms.br/bitstream/123456789/5119/-1/JurisBERT__Transformer_based_model_for_embedding_legal_texts.pdf11b5dc6f31112d6e055fe6818622ac26MD5-1123456789/51192022-09-26 09:35:37.055oai:repositorio.ufms.br:123456789/5119Repositório InstitucionalPUBhttps://repositorio.ufms.br/oai/requestri.prograd@ufms.bropendoar:21242022-09-26T13:35:37Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)false
dc.title.pt_BR.fl_str_mv	JurisBERT: Transformer-based model for embedding legal texts
title	JurisBERT: Transformer-based model for embedding legal texts
spellingShingle	JurisBERT: Transformer-based model for embedding legal texts Charles Felipe Oliveira Viegas Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT
title_short	JurisBERT: Transformer-based model for embedding legal texts
title_full	JurisBERT: Transformer-based model for embedding legal texts
title_fullStr	JurisBERT: Transformer-based model for embedding legal texts
title_full_unstemmed	JurisBERT: Transformer-based model for embedding legal texts
title_sort	JurisBERT: Transformer-based model for embedding legal texts
author	Charles Felipe Oliveira Viegas
author_facet	Charles Felipe Oliveira Viegas
author_role	author
dc.contributor.advisor1.fl_str_mv	Renato Porfirio Ishii
dc.contributor.author.fl_str_mv	Charles Felipe Oliveira Viegas
contributor_str_mv	Renato Porfirio Ishii
dc.subject.por.fl_str_mv	Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT
topic	Retrieving Legal Precedents, Semantic Textual Similarity, Sentence Embedding, BERT
description	We propose in this paper a new extension of BERT (Bidirectional Encoder Representations from Transformers), called JurisBERT. It is applied in Semantic Textual Similarity (STS) and there is a considered improvement in fastness, in precision and it requires less computational resources than other approaches. JurisBERT was trained from scratch with specific domain texts to deal with laws, treatises, and precedents, and has better precision compared to other BERT models, which was our main finding of this work. Furthermore, our approach considers the concept of sublanguage, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT includes 24,000 pairs of ementas with degrees of similarity varying from 0 to 3. We extract these ementas from search mechanisms available on the courts' websites, in order to validate the approach with real data. Our experiments showed JurisBERT is better than other models in four scenarios: multilingual BERT and BERTimbau without fine-tuning in around 22% and 12% precision (F1), respectively; and with fine-tuning in around 20% and 4%. Moreover, our approach reduced 5 times the training steps, besides using accessible hardware, i.e., low-cost GPGPU architectures. This result demonstrates that not always pre-trained models, such as BERT Multilingual and BERTimbau, which are heavy, and require specialized and expensive hardware, are the best solution. So, we have proven that training the BERT from scratch with domain-specific texts has greater accuracy and shorter training time than large and general pre-trained models. The source code is available at https://github.com/juridics/brazilian-legal-text-dataset.
publishDate	2022
dc.date.accessioned.fl_str_mv	2022-09-26T13:35:35Z
dc.date.available.fl_str_mv	2022-09-26T13:35:35Z
dc.date.issued.fl_str_mv	2022
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://repositorio.ufms.br/handle/123456789/5119
url	https://repositorio.ufms.br/handle/123456789/5119
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Fundação Universidade Federal de Mato Grosso do Sul
dc.publisher.initials.fl_str_mv	UFMS
dc.publisher.country.fl_str_mv	Brasil
publisher.none.fl_str_mv	Fundação Universidade Federal de Mato Grosso do Sul
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMS instname:Universidade Federal de Mato Grosso do Sul (UFMS) instacron:UFMS
instname_str	Universidade Federal de Mato Grosso do Sul (UFMS)
instacron_str	UFMS
institution	UFMS
reponame_str	Repositório Institucional da UFMS
collection	Repositório Institucional da UFMS
bitstream.url.fl_str_mv	https://repositorio.ufms.br/bitstream/123456789/5119/-1/JurisBERT__Transformer_based_model_for_embedding_legal_texts.pdf
bitstream.checksum.fl_str_mv	11b5dc6f31112d6e055fe6818622ac26
bitstream.checksumAlgorithm.fl_str_mv	MD5
repository.name.fl_str_mv	Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)
repository.mail.fl_str_mv	ri.prograd@ufms.br
_version_	1807552826518274048

JurisBERT: Transformer-based model for embedding legal texts

Registros relacionados