Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos

Mastella, Juliana Obino

Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos

Detalhes bibliográficos
Autor(a) principal:	Mastella, Juliana Obino
Data de Publicação:	2020
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações da PUC_RS
Texto Completo:	http://tede2.pucrs.br/tede2/handle/tede/9908
Resumo:	In last years it has been witnessed an exponential growth of data volume, data variability and data velocity. It is known that most of them are in an unstructured availability which intensify the data analysis challenge. Considering this scenario, the usage os Natural Language Processing (NLP) tools for text classification has been inspiring researchers from several knowlage domains, among them it can be highlighted the Legal Sciences. The justice in its root depends on analysis of huge text data volume which turns it into an important potential area for applying NLP tools. The choice of an algorithm for solving a specific text classification issue is not a trivial task. The picked classification approach quality and viability will depends on the issue to be solved, the data volume and the data behavior, in addition to the best use of available computational resources in order to results be delivered in time. Motivated by the problem of automatic classification of legal texts for application to electronic processes of a Brazilian State Court, this research proposes a methodology to optimize the choice of parameters for the classification algorithm of legal documents paralleling the training of Bi-LSTM Recurrent Neural Networks. For data application 107,010 petitions from a Brazilian State Court, with classes previously noted, underwent training of 216 Recurrent Neural Networks in parallel. At the end of training, the best individual performance was F1 = 0.846. Combining the 4 best models through an Ensemble technique resulted in a final model with lower performance than the best individual one (F1 = 0.826). Through the parallel training of models it was possible to reach a superior result to the majority of the tested parameterizations (10 % better than the worst parameterization tested and 9.8% better than the average ) in approximately 20 times less time than it would take for test all the same possibilities sequentially.

Metadados do item

id	P_RS_90409396b86c333e74d5ed1ec5905e2d
oai_identifier_str	oai:tede2.pucrs.br:tede/9908
network_acronym_str	P_RS
network_name_str	Biblioteca Digital de Teses e Dissertações da PUC_RS
repository_id_str
spelling	De Rose, César Augusto FonticielhaMastella, Juliana Obino2021-10-14T17:50:39Z2020-08-31http://tede2.pucrs.br/tede2/handle/tede/9908In last years it has been witnessed an exponential growth of data volume, data variability and data velocity. It is known that most of them are in an unstructured availability which intensify the data analysis challenge. Considering this scenario, the usage os Natural Language Processing (NLP) tools for text classification has been inspiring researchers from several knowlage domains, among them it can be highlighted the Legal Sciences. The justice in its root depends on analysis of huge text data volume which turns it into an important potential area for applying NLP tools. The choice of an algorithm for solving a specific text classification issue is not a trivial task. The picked classification approach quality and viability will depends on the issue to be solved, the data volume and the data behavior, in addition to the best use of available computational resources in order to results be delivered in time. Motivated by the problem of automatic classification of legal texts for application to electronic processes of a Brazilian State Court, this research proposes a methodology to optimize the choice of parameters for the classification algorithm of legal documents paralleling the training of Bi-LSTM Recurrent Neural Networks. For data application 107,010 petitions from a Brazilian State Court, with classes previously noted, underwent training of 216 Recurrent Neural Networks in parallel. At the end of training, the best individual performance was F1 = 0.846. Combining the 4 best models through an Ensemble technique resulted in a final model with lower performance than the best individual one (F1 = 0.826). Through the parallel training of models it was possible to reach a superior result to the majority of the tested parameterizations (10 % better than the worst parameterization tested and 9.8% better than the average ) in approximately 20 times less time than it would take for test all the same possibilities sequentially.Nos últimos anos testemunhou-se um crescimento exponencial do volume, da variabilidade e da velocidade com que novos dados são gerados. Sabe-se que a maior parte desses dados se apresenta de forma não-estruturada, o que aumenta ainda mais o desafio de analisar esses dados. Nesse cenário, a aplicação de técnicas de Processamento da Linguagem Natural (PLN) para classificação de textos de forma automática tem despertado o interesse de pesquisadores dos mais diversos domínios do conhecimento, dentre os quais pode-se destacar as Ciências Jurídicas. O Direito inerentemente depende da análise de um grande volume de informações textuais, o que o torna uma área com grande potencial para aplicação de técnicas de PLN. A escolha do algoritmo para solucionar um determinado problema de classificação de textos não é uma tarefa trivial. A qualidade e a viabilidade da abordagem de classificação escolhida dependerão do problema a ser resolvido, do volume e do comportamento dos dados, além da melhor utilização dos recursos computacionais disponíveis para que o resultado seja entregue em tempo adequado. Motivada pelo problema da classificação automática de textos jurídicos para aplicação a processos eletrônicos de um Tribunal Estadual Brasileiro, esta pesquisa propõe uma metodologia para otimizar a escolha de parâmetros do algoritmo de classificação de documentos jurídicos paralelizando o treinamento de Redes Neurais Recorrentes Bi-LSTM. Para aplicação a dados reais, 107.010 petições de um Tribunal Estadual Brasileiro, com classes previamente anotadas, foram submetidas ao treinamento de 216 Redes Neurais Recorrentes em paralelo. Ao final do treinamento, o modelo com melhor desempenho individual apresentou F1 = 0,846. Combinando-se os 4 melhores resultados individuais através de uma técnica Ensemble, pela regra da soma, não foi identificada melhora no desempenho (F1 = 0,826). Através do treinamento em paralelo dos modelos, foi possível chegar a um resultado superior à maioria das parametrizações testadas (10% melhor do que a pior parametrização testada e 9,8% superior à média das combinações testadas) em aproximadamente 20 vezes menos tempo do que se levaria para testar todas as mesmas possibilidades de maneira sequencial.Submitted by PPG Ciência da Computação (ppgcc@pucrs.br) on 2021-10-14T13:19:25Z No. of bitstreams: 1 JULIANA OBINO MASTELLA_DIS.pdf: 1387653 bytes, checksum: d545a6b285249f7f12c6b6826d9baa36 (MD5)Approved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2021-10-14T17:41:16Z (GMT) No. of bitstreams: 1 JULIANA OBINO MASTELLA_DIS.pdf: 1387653 bytes, checksum: d545a6b285249f7f12c6b6826d9baa36 (MD5)Made available in DSpace on 2021-10-14T17:50:39Z (GMT). No. of bitstreams: 1 JULIANA OBINO MASTELLA_DIS.pdf: 1387653 bytes, checksum: d545a6b285249f7f12c6b6826d9baa36 (MD5) Previous issue date: 2020-08-31Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESapplication/pdfhttp://tede2.pucrs.br:80/tede2/retrieve/182281/JULIANA%20OBINO%20MASTELLA_DIS.pdf.jpgporPontifícia Universidade Católica do Rio Grande do SulPrograma de Pós-Graduação em Ciência da ComputaçãoPUCRSBrasilEscola PolitécnicaClassificação de TextosAlgoritmos de ClassificaçãoMineração de TextosClassificação de DocumentosDocumentos JurídicosPLNParalelismoText ClassificationClassification AlgorithmsText MiningDocuments ClassificationLegal DocumentsNLPParameter SweepParallelismCIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAOUma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicosA methodology using parallel environments to optimize text classification in legal documentsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisTrabalho não apresenta restrição para publicação-4570527706994352458500500600-8620782570833253013590462550136975366info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da PUC_RSinstname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)instacron:PUC_RSTHUMBNAILJULIANA OBINO MASTELLA_DIS.pdf.jpgJULIANA OBINO MASTELLA_DIS.pdf.jpgimage/jpeg5879http://tede2.pucrs.br/tede2/bitstream/tede/9908/4/JULIANA+OBINO+MASTELLA_DIS.pdf.jpg44dcb561327fab5f07832f2775da0682MD54TEXTJULIANA OBINO MASTELLA_DIS.pdf.txtJULIANA OBINO MASTELLA_DIS.pdf.txttext/plain112268http://tede2.pucrs.br/tede2/bitstream/tede/9908/3/JULIANA+OBINO+MASTELLA_DIS.pdf.txt3defc0af3944c7f37111f8b5e527d6a0MD53ORIGINALJULIANA OBINO MASTELLA_DIS.pdfJULIANA OBINO MASTELLA_DIS.pdfapplication/pdf1387653http://tede2.pucrs.br/tede2/bitstream/tede/9908/2/JULIANA+OBINO+MASTELLA_DIS.pdfd545a6b285249f7f12c6b6826d9baa36MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-8590http://tede2.pucrs.br/tede2/bitstream/tede/9908/1/license.txt220e11f2d3ba5354f917c7035aadef24MD51tede/99082021-10-14 20:00:21.377oai:tede2.pucrs.br:tede/9908QXV0b3JpemE/P28gcGFyYSBQdWJsaWNhPz9vIEVsZXRyP25pY2E6IENvbSBiYXNlIG5vIGRpc3Bvc3RvIG5hIExlaSBGZWRlcmFsIG4/OS42MTAsIGRlIDE5IGRlIGZldmVyZWlybyBkZSAxOTk4LCBvIGF1dG9yIEFVVE9SSVpBIGEgcHVibGljYT8/byBlbGV0cj9uaWNhIGRhIHByZXNlbnRlIG9icmEgbm8gYWNlcnZvIGRhIEJpYmxpb3RlY2EgRGlnaXRhbCBkYSBQb250aWY/Y2lhIFVuaXZlcnNpZGFkZSBDYXQ/bGljYSBkbyBSaW8gR3JhbmRlIGRvIFN1bCwgc2VkaWFkYSBhIEF2LiBJcGlyYW5nYSA2NjgxLCBQb3J0byBBbGVncmUsIFJpbyBHcmFuZGUgZG8gU3VsLCBjb20gcmVnaXN0cm8gZGUgQ05QSiA4ODYzMDQxMzAwMDItODEgYmVtIGNvbW8gZW0gb3V0cmFzIGJpYmxpb3RlY2FzIGRpZ2l0YWlzLCBuYWNpb25haXMgZSBpbnRlcm5hY2lvbmFpcywgY29ucz9yY2lvcyBlIHJlZGVzID9zIHF1YWlzIGEgYmlibGlvdGVjYSBkYSBQVUNSUyBwb3NzYSBhIHZpciBwYXJ0aWNpcGFyLCBzZW0gP251cyBhbHVzaXZvIGFvcyBkaXJlaXRvcyBhdXRvcmFpcywgYSB0P3R1bG8gZGUgZGl2dWxnYT8/byBkYSBwcm9kdT8/byBjaWVudD9maWNhLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://tede2.pucrs.br/tede2/PRIhttps://tede2.pucrs.br/oai/requestbiblioteca.central@pucrs.br\|\|opendoar:2021-10-14T23:00:21Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)false
dc.title.por.fl_str_mv	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
dc.title.alternative.eng.fl_str_mv	A methodology using parallel environments to optimize text classification in legal documents
title	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
spellingShingle	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos Mastella, Juliana Obino Classificação de Textos Algoritmos de Classificação Mineração de Textos Classificação de Documentos Documentos Jurídicos PLN Paralelismo Text Classification Classification Algorithms Text Mining Documents Classification Legal Documents NLP Parameter Sweep Parallelism CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
title_short	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
title_full	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
title_fullStr	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
title_full_unstemmed	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
title_sort	Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos
author	Mastella, Juliana Obino
author_facet	Mastella, Juliana Obino
author_role	author
dc.contributor.advisor1.fl_str_mv	De Rose, César Augusto Fonticielha
dc.contributor.author.fl_str_mv	Mastella, Juliana Obino
contributor_str_mv	De Rose, César Augusto Fonticielha
dc.subject.por.fl_str_mv	Classificação de Textos Algoritmos de Classificação Mineração de Textos Classificação de Documentos Documentos Jurídicos PLN Paralelismo
topic	Classificação de Textos Algoritmos de Classificação Mineração de Textos Classificação de Documentos Documentos Jurídicos PLN Paralelismo Text Classification Classification Algorithms Text Mining Documents Classification Legal Documents NLP Parameter Sweep Parallelism CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Text Classification Classification Algorithms Text Mining Documents Classification Legal Documents NLP Parameter Sweep Parallelism
dc.subject.cnpq.fl_str_mv	CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
description	In last years it has been witnessed an exponential growth of data volume, data variability and data velocity. It is known that most of them are in an unstructured availability which intensify the data analysis challenge. Considering this scenario, the usage os Natural Language Processing (NLP) tools for text classification has been inspiring researchers from several knowlage domains, among them it can be highlighted the Legal Sciences. The justice in its root depends on analysis of huge text data volume which turns it into an important potential area for applying NLP tools. The choice of an algorithm for solving a specific text classification issue is not a trivial task. The picked classification approach quality and viability will depends on the issue to be solved, the data volume and the data behavior, in addition to the best use of available computational resources in order to results be delivered in time. Motivated by the problem of automatic classification of legal texts for application to electronic processes of a Brazilian State Court, this research proposes a methodology to optimize the choice of parameters for the classification algorithm of legal documents paralleling the training of Bi-LSTM Recurrent Neural Networks. For data application 107,010 petitions from a Brazilian State Court, with classes previously noted, underwent training of 216 Recurrent Neural Networks in parallel. At the end of training, the best individual performance was F1 = 0.846. Combining the 4 best models through an Ensemble technique resulted in a final model with lower performance than the best individual one (F1 = 0.826). Through the parallel training of models it was possible to reach a superior result to the majority of the tested parameterizations (10 % better than the worst parameterization tested and 9.8% better than the average ) in approximately 20 times less time than it would take for test all the same possibilities sequentially.
publishDate	2020
dc.date.issued.fl_str_mv	2020-08-31
dc.date.accessioned.fl_str_mv	2021-10-14T17:50:39Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://tede2.pucrs.br/tede2/handle/tede/9908
url	http://tede2.pucrs.br/tede2/handle/tede/9908
dc.language.iso.fl_str_mv	por
language	por
dc.relation.program.fl_str_mv	-4570527706994352458
dc.relation.confidence.fl_str_mv	500 500 600
dc.relation.cnpq.fl_str_mv	-862078257083325301
dc.relation.sponsorship.fl_str_mv	3590462550136975366
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Pontifícia Universidade Católica do Rio Grande do Sul
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	PUCRS
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	Escola Politécnica
publisher.none.fl_str_mv	Pontifícia Universidade Católica do Rio Grande do Sul
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da PUC_RS instname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS) instacron:PUC_RS
instname_str	Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
instacron_str	PUC_RS
institution	PUC_RS
reponame_str	Biblioteca Digital de Teses e Dissertações da PUC_RS
collection	Biblioteca Digital de Teses e Dissertações da PUC_RS
bitstream.url.fl_str_mv	http://tede2.pucrs.br/tede2/bitstream/tede/9908/4/JULIANA+OBINO+MASTELLA_DIS.pdf.jpg http://tede2.pucrs.br/tede2/bitstream/tede/9908/3/JULIANA+OBINO+MASTELLA_DIS.pdf.txt http://tede2.pucrs.br/tede2/bitstream/tede/9908/2/JULIANA+OBINO+MASTELLA_DIS.pdf http://tede2.pucrs.br/tede2/bitstream/tede/9908/1/license.txt
bitstream.checksum.fl_str_mv	44dcb561327fab5f07832f2775da0682 3defc0af3944c7f37111f8b5e527d6a0 d545a6b285249f7f12c6b6826d9baa36 220e11f2d3ba5354f917c7035aadef24
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
repository.mail.fl_str_mv	biblioteca.central@pucrs.br\|\|
_version_	1799765352454815744

Uma metodologia usando ambientes paralelos para otimização da classificação de textos aplicada a documentos jurídicos

Registros relacionados