Representação de sentenças jurídicas no contexto de agrupamento automático

Gonçalves, Cristiano Oliveira

Representação de sentenças jurídicas no contexto de agrupamento automático

Detalhes bibliográficos
Autor(a) principal:	Gonçalves, Cristiano Oliveira
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da UFABC
Texto Completo:	http://biblioteca.ufabc.edu.br/index.php?codigo_sophia=124239
Resumo:	Orientador: Prof. Dr. Thiago Ferreira Covões

Metadados do item

id	UFBC_3847a779bb4011c1f9c80af04f34b98a
oai_identifier_str	oai:BDTD:124239
network_acronym_str	UFBC
network_name_str	Repositório Institucional da UFABC
repository_id_str
spelling	Representação de sentenças jurídicas no contexto de agrupamento automáticoAGRUPAMENTO TEXTUALREPRESENTAÇÃO TEXTUALJURIMETRIATEXT CLUSTERINGTEXT REPRESENTATIONJURIMETRICSPROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO - UFABCOrientador: Prof. Dr. Thiago Ferreira CovõesDissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, Santo André, 2022.A digitalização de documentos no setor judiciário brasileiro facilita o acesso à informação de interesse público. No entanto, para que seja possível levantar métricas de interesse deste crescente repositório informacional, é fundamental que se organizem os documentos de maneira a facilitar a recuperação de informações relevantes, e técnicas de aprendizado de máquina podem diminuir o esforço humano na organização de um grande corpus. Este trabalho analisou diferentes técnicas de aprendizado de máquina frente à quão bem associam termos jurídicos segundo especialistas humanos. Para isso, desenvolvemos um web scrapper, software que consolida conteúdos online, para criar um corpus de sentenças jurídicas de primeira instância. Este corpus é composto de 40.009 documentos, o que totaliza 24.139.185 tokens. As técnicas FastText, GloVe e Word2Vec foram avaliadas frente à sua capacidade de associar termos de acordo com o Tesauro Jurídico do Supremo Tribunal Federal (TSTF). Elas foram comparadas quando treinadas tanto no domínio geral da língua portuguesa, quanto no domínio jurídico. O modelo FastText de domínio geral foi o que apresentou a maior similaridade entre os termos associados segundo o TSTF. Apesar disso, o FastText de domínio jurídico apresentou desempenhos comparáveis ou superiores aos modelos GloVe e Word2Vec de domínio geral. Avaliamos também as técnicas FastText, GloVe, Word2Vec, Doc2Vec e hashing trick na tarefa de agrupamento de sentenças jurídicas de primeira instância frente ao assunto a que pertencem. Comparamos os modelos treinados tanto no domínio geral quanto no domínio jurídico usando a V-Measure média e seu desvio-padrão. Concluímos que o FastText de domínio jurídico treinado em 300 dimensões apresentou resultados equivalentes ou superiores aos modelos de domínio geral. Observamos também que a escolha da técnica possui influência maior do que a escolha de hiper-parâmetros na determinação do desempenho. Outro fator analisado neste trabalho foi a semelhança dos documentos de diferentes assuntos. Usamos nesta análise o melhor modelo produzido no domínio jurídico: o FastText de 300 dimensões. Concluímos que apesar da incerteza da própria representação criada pelo modelo, parecem haver documentos de diferentes assuntos que são muito similares entre si. Avaliamos ainda o aumento de desempenho conferido pelo volume de documentos jurídicos no processo de treinamento, e verificamos que a partir de aproximadamente 800.000 tokens, que equivalem a aproximadamente de 1500 sentenças, os aumentos de desempenho marginal do FastText de 300 dimensões é decrescente. A adição de mais documentos do mesmo corpus confere ganhos de desempenho incrementalmente muito pequenos, sendo que o custo computacional parece crescer mais rápido que a V-Measure.The digitization of documents in the Brazilian judicial sector facilitates access to information of public interest. However, in order to be able to raise metrics of interest to this growing information repository, it is essential to organize documents in a way that makes the retrieval of relevant information easier, and machine learning techniques can reduce human effort in organizing a large corpus. This work analyzed different machine learning techniques in face of how well they associate legal terms according to human specialists. To do this, we developed a web scrapper to create a corpus of first instance legal sentences. This corpus is composed of 40,009 documents, totaling 24,139,185 tokens. FastText, GloVe and Word2Vec techniques were evaluated for their ability to associate terms in accordance with the Legal Thesaurus of the Federal Supreme Court (TSTF). They were compared when trained both in the general domain of the Portuguese language and in the legal domain of the same language. The FastText model trained on the general domain corpus showed the greatest similarity between the associated terms according to the TSTF. Despite this, the legal domain FastText performed comparable or superior to the general domain GloVe and Word2Vec models. We also evaluated the FastText, GloVe, Word2Vec, Doc2Vec and hashing trick techniques in the task of grouping first instance legal sentences against the subject to which they belong. We compare the trained models in both the general and legal domains using the V-Measure. We conclude that FastText trained on legal domain corpus, with 300 dimensions, presented equivalent or superior results to models trained on the general domain corpus. We also observed that the choice of technique has a greater influence than the choice of hyper-parameters in determining performance. Another factor analyzed in this work was the similarity of documents on different subjects. In this analysis, we used the best model produced in the legal domain: the 300-dimensional FastText. We conclude that despite the uncertainty of the representation created by the model, there seem to be documents on different subjects that are very similar to each other. We also evaluated the performance increase given by the volume of legal documents in the training process, and found that from approximately 800,000 tokens, which is equivalent to approximately 1500 sentences, the marginal performance increases of the 300-dimensional FastText decreases as we add more documentos from the legal domain on the training set. Adding more documents of this domain seems to increase computational cost more than it increases the model performance.Covões, Thiago FerreiraSilva, Nádia Félix Felipe daMena-Chalco, Jesús PascualGonçalves, Cristiano Oliveira2022info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdf125 f. : il.http://biblioteca.ufabc.edu.br/index.php?codigo_sophia=124239http://biblioteca.ufabc.edu.br/index.php?codigo_sophia=124239&midiaext=80773Cover: http://biblioteca.ufabc.edu.br/php/capa.php?obra=124239porreponame:Repositório Institucional da UFABCinstname:Universidade Federal do ABC (UFABC)instacron:UFABCinfo:eu-repo/semantics/openAccess2023-04-20T17:54:12Zoai:BDTD:124239Repositório InstitucionalPUBhttp://www.biblioteca.ufabc.edu.br/oai/oai.phpopendoar:2023-04-20T17:54:12Repositório Institucional da UFABC - Universidade Federal do ABC (UFABC)false
dc.title.none.fl_str_mv	Representação de sentenças jurídicas no contexto de agrupamento automático
title	Representação de sentenças jurídicas no contexto de agrupamento automático
spellingShingle	Representação de sentenças jurídicas no contexto de agrupamento automático Gonçalves, Cristiano Oliveira AGRUPAMENTO TEXTUAL REPRESENTAÇÃO TEXTUAL JURIMETRIA TEXT CLUSTERING TEXT REPRESENTATION JURIMETRICS PROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO - UFABC
title_short	Representação de sentenças jurídicas no contexto de agrupamento automático
title_full	Representação de sentenças jurídicas no contexto de agrupamento automático
title_fullStr	Representação de sentenças jurídicas no contexto de agrupamento automático
title_full_unstemmed	Representação de sentenças jurídicas no contexto de agrupamento automático
title_sort	Representação de sentenças jurídicas no contexto de agrupamento automático
author	Gonçalves, Cristiano Oliveira
author_facet	Gonçalves, Cristiano Oliveira
author_role	author
dc.contributor.none.fl_str_mv	Covões, Thiago Ferreira Silva, Nádia Félix Felipe da Mena-Chalco, Jesús Pascual
dc.contributor.author.fl_str_mv	Gonçalves, Cristiano Oliveira
dc.subject.por.fl_str_mv	AGRUPAMENTO TEXTUAL REPRESENTAÇÃO TEXTUAL JURIMETRIA TEXT CLUSTERING TEXT REPRESENTATION JURIMETRICS PROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO - UFABC
topic	AGRUPAMENTO TEXTUAL REPRESENTAÇÃO TEXTUAL JURIMETRIA TEXT CLUSTERING TEXT REPRESENTATION JURIMETRICS PROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO - UFABC
description	Orientador: Prof. Dr. Thiago Ferreira Covões
publishDate	2022
dc.date.none.fl_str_mv	2022
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://biblioteca.ufabc.edu.br/index.php?codigo_sophia=124239
url	http://biblioteca.ufabc.edu.br/index.php?codigo_sophia=124239
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	http://biblioteca.ufabc.edu.br/index.php?codigo_sophia=124239&midiaext=80773 Cover: http://biblioteca.ufabc.edu.br/php/capa.php?obra=124239
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf 125 f. : il.
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFABC instname:Universidade Federal do ABC (UFABC) instacron:UFABC
instname_str	Universidade Federal do ABC (UFABC)
instacron_str	UFABC
institution	UFABC
reponame_str	Repositório Institucional da UFABC
collection	Repositório Institucional da UFABC
repository.name.fl_str_mv	Repositório Institucional da UFABC - Universidade Federal do ABC (UFABC)
repository.mail.fl_str_mv
_version_	1801502110346379264

Representação de sentenças jurídicas no contexto de agrupamento automático

Registros relacionados