Detecção de linguagem tóxica aplicada a textos em português

Trajano, Douglas de Oliveira

Detecção de linguagem tóxica aplicada a textos em português

Detalhes bibliográficos
Autor(a) principal:	Trajano, Douglas de Oliveira
Data de Publicação:	2023
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações da PUC_RS
Texto Completo:	https://tede2.pucrs.br/tede2/handle/tede/10782
Resumo:	The advent of social media has transformed the way in which individuals and communities interact and communicate. However, the messages on social media may contain expressions of opinion, and support messages, but they can also hate speech. The proliferation of hate speech in the digital sphere has become an increasingly pressing issue, with polarized opinions and a sense of anonymity and impunity among users often serving as contributing factors. The haters, users who spread hate speech, can be found in a variety of topics, including political discussions, entertainment, gaming, and corporate environments. The Natural Language Processing (NLP) area can contribute with tools to ensure healthy communication and protect users’ rights online. NLP applications are efficient, standardized, and automated, eliminating the need for manual moderation of such content. In this study, we used advanced machine learning and deep learning techniques to develop a toxic language detection system in Portuguese messages. The dataset used for training the models consists of 6,354 (with the possibility of extending to 13,538) comments manually annotated by experts. This dataset, made available as part of the work, has annotations for 5 NLP tasks, using a hierarchical annotation scheme with different levels of granularity. The results of the experiments demonstrate the usefulness of this dataset for the development of NLP systems aimed at detecting toxic language in texts in Portuguese.

Metadados do item

id	P_RS_8906fb7ee1613f675ce82076bf05a626
oai_identifier_str	oai:tede2.pucrs.br:tede/10782
network_acronym_str	P_RS
network_name_str	Biblioteca Digital de Teses e Dissertações da PUC_RS
repository_id_str
spelling	Bordini, Rafael Heitorhttp://lattes.cnpq.br/4589262718627942http://lattes.cnpq.br/5924591783668175Trajano, Douglas de Oliveira2023-05-25T17:02:12Z2023-02-27https://tede2.pucrs.br/tede2/handle/tede/10782The advent of social media has transformed the way in which individuals and communities interact and communicate. However, the messages on social media may contain expressions of opinion, and support messages, but they can also hate speech. The proliferation of hate speech in the digital sphere has become an increasingly pressing issue, with polarized opinions and a sense of anonymity and impunity among users often serving as contributing factors. The haters, users who spread hate speech, can be found in a variety of topics, including political discussions, entertainment, gaming, and corporate environments. The Natural Language Processing (NLP) area can contribute with tools to ensure healthy communication and protect users’ rights online. NLP applications are efficient, standardized, and automated, eliminating the need for manual moderation of such content. In this study, we used advanced machine learning and deep learning techniques to develop a toxic language detection system in Portuguese messages. The dataset used for training the models consists of 6,354 (with the possibility of extending to 13,538) comments manually annotated by experts. This dataset, made available as part of the work, has annotations for 5 NLP tasks, using a hierarchical annotation scheme with different levels of granularity. The results of the experiments demonstrate the usefulness of this dataset for the development of NLP systems aimed at detecting toxic language in texts in Portuguese.As redes sociais têm revolucionado a forma como a sociedade se comunica, graças à sua natureza descentralizada que permite a interação entre os usuários. No entanto, as mensagens que circulam nas redes sociais podem conter expressões de opinião, mensagens de apoio e, mas também discurso de ódio. O discurso de ódio é um problema crescente na esfera digital, geralmente causado pela polarização de opiniões ou pela falsa sensação de impunidade. Os haters, usuários que disseminam o discurso de ódio, podem ser encontrados em uma variedade de tópicos, incluindo debates políticos, entretenimento, jogos online e ambientes corporativos. A área de Processamento de Linguagem Natural (PLN) pode contribuir com ferramentas para assegurar uma comunicação saudável e garantir os direitos dos usuários no mundo digital, agindo de forma rápida, padronizada e automatizada, evitando a necessidade de moderação manual deste tipo de conteúdo. Neste estudo, utilizamos técnicas avançadas de aprendizado de máquina e aprendizado profundo para desenvolver um sistema de detecção de linguagem tóxica em mensagens em Português. O conjunto de dados utilizado para o treinamento dos modelos é composto por 6.354 (com possibilidade de extensão para 13.538) comentários anotados manualmente por especialistas. Este conjunto de dados, disponibilizado como parte do trabalho, possui anotações para 5 tarefas de PLN, utilizando um esquema de anotação hierárquico com diferentes níveis de granularidade. Os resultados dos experimentos demonstram a utilidade desse conjunto de dados para o desenvolvimento de sistemas de PLN voltados para a detecção de linguagem tóxica em textos em Português.Submitted by PPG Ciência da Computação (ppgcc@pucrs.br) on 2023-05-11T18:15:15Z No. of bitstreams: 1 DOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf: 1508066 bytes, checksum: c0d1176f79f14a57ab325767922b4d63 (MD5)Approved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2023-05-25T16:55:30Z (GMT) No. of bitstreams: 1 DOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf: 1508066 bytes, checksum: c0d1176f79f14a57ab325767922b4d63 (MD5)Made available in DSpace on 2023-05-25T17:02:12Z (GMT). No. of bitstreams: 1 DOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf: 1508066 bytes, checksum: c0d1176f79f14a57ab325767922b4d63 (MD5) Previous issue date: 2023-02-27application/pdfhttps://tede2.pucrs.br/tede2/retrieve/187591/DOUGLAS%20DE%20OLIVEIRA%20TRAJANO_DIS.pdf.jpgporPontifícia Universidade Católica do Rio Grande do SulPrograma de Pós-Graduação em Ciência da ComputaçãoPUCRSBrasilEscola PolitécnicaProcessamento de Linguagem NaturalExtração de InformaçõesClassificação de TextoReconhecimento de EntidadesDetecção de Discurso de ÓdioLinguagem TóxicaComentário OfensivoSegurança OnlineComentário TóxicoToxicidadeRacismoHomofobiaXenofobiaNatural Language ProcessingInformation ExtractionText ClassificationNamedEntity RecognitionHate Speech DetectionToxic LanguageOffensive CommentToxic CommentToxicityRacismHomophobiaXenophobiaCIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAODetecção de linguagem tóxica aplicada a textos em portuguêsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisTrabalho não apresenta restrição para publicação-4570527706994352458500500-862078257083325301info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da PUC_RSinstname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)instacron:PUC_RSTHUMBNAILDOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf.jpgDOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf.jpgimage/jpeg5361https://tede2.pucrs.br/tede2/bitstream/tede/10782/4/DOUGLAS+DE+OLIVEIRA+TRAJANO_DIS.pdf.jpgb23668d436fdb29a190ff74765b1721aMD54TEXTDOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf.txtDOUGLAS DE OLIVEIRA TRAJANO_DIS.pdf.txttext/plain164371https://tede2.pucrs.br/tede2/bitstream/tede/10782/3/DOUGLAS+DE+OLIVEIRA+TRAJANO_DIS.pdf.txt2041e03b74068982ad61188e71d95f16MD53ORIGINALDOUGLAS DE OLIVEIRA TRAJANO_DIS.pdfDOUGLAS DE OLIVEIRA TRAJANO_DIS.pdfapplication/pdf1508066https://tede2.pucrs.br/tede2/bitstream/tede/10782/2/DOUGLAS+DE+OLIVEIRA+TRAJANO_DIS.pdfc0d1176f79f14a57ab325767922b4d63MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-8590https://tede2.pucrs.br/tede2/bitstream/tede/10782/1/license.txt220e11f2d3ba5354f917c7035aadef24MD51tede/107822023-05-25 20:00:18.917oai:tede2.pucrs.br:tede/10782QXV0b3JpemE/P28gcGFyYSBQdWJsaWNhPz9vIEVsZXRyP25pY2E6IENvbSBiYXNlIG5vIGRpc3Bvc3RvIG5hIExlaSBGZWRlcmFsIG4/OS42MTAsIGRlIDE5IGRlIGZldmVyZWlybyBkZSAxOTk4LCBvIGF1dG9yIEFVVE9SSVpBIGEgcHVibGljYT8/byBlbGV0cj9uaWNhIGRhIHByZXNlbnRlIG9icmEgbm8gYWNlcnZvIGRhIEJpYmxpb3RlY2EgRGlnaXRhbCBkYSBQb250aWY/Y2lhIFVuaXZlcnNpZGFkZSBDYXQ/bGljYSBkbyBSaW8gR3JhbmRlIGRvIFN1bCwgc2VkaWFkYSBhIEF2LiBJcGlyYW5nYSA2NjgxLCBQb3J0byBBbGVncmUsIFJpbyBHcmFuZGUgZG8gU3VsLCBjb20gcmVnaXN0cm8gZGUgQ05QSiA4ODYzMDQxMzAwMDItODEgYmVtIGNvbW8gZW0gb3V0cmFzIGJpYmxpb3RlY2FzIGRpZ2l0YWlzLCBuYWNpb25haXMgZSBpbnRlcm5hY2lvbmFpcywgY29ucz9yY2lvcyBlIHJlZGVzID9zIHF1YWlzIGEgYmlibGlvdGVjYSBkYSBQVUNSUyBwb3NzYSBhIHZpciBwYXJ0aWNpcGFyLCBzZW0gP251cyBhbHVzaXZvIGFvcyBkaXJlaXRvcyBhdXRvcmFpcywgYSB0P3R1bG8gZGUgZGl2dWxnYT8/byBkYSBwcm9kdT8/byBjaWVudD9maWNhLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://tede2.pucrs.br/tede2/PRIhttps://tede2.pucrs.br/oai/requestbiblioteca.central@pucrs.br\|\|opendoar:2023-05-25T23:00:18Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)false
dc.title.por.fl_str_mv	Detecção de linguagem tóxica aplicada a textos em português
title	Detecção de linguagem tóxica aplicada a textos em português
spellingShingle	Detecção de linguagem tóxica aplicada a textos em português Trajano, Douglas de Oliveira Processamento de Linguagem Natural Extração de Informações Classificação de Texto Reconhecimento de Entidades Detecção de Discurso de Ódio Linguagem Tóxica Comentário Ofensivo Segurança Online Comentário Tóxico Toxicidade Racismo Homofobia Xenofobia Natural Language Processing Information Extraction Text Classification NamedEntity Recognition Hate Speech Detection Toxic Language Offensive Comment Toxic Comment Toxicity Racism Homophobia Xenophobia CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
title_short	Detecção de linguagem tóxica aplicada a textos em português
title_full	Detecção de linguagem tóxica aplicada a textos em português
title_fullStr	Detecção de linguagem tóxica aplicada a textos em português
title_full_unstemmed	Detecção de linguagem tóxica aplicada a textos em português
title_sort	Detecção de linguagem tóxica aplicada a textos em português
author	Trajano, Douglas de Oliveira
author_facet	Trajano, Douglas de Oliveira
author_role	author
dc.contributor.advisor1.fl_str_mv	Bordini, Rafael Heitor
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/4589262718627942
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/5924591783668175
dc.contributor.author.fl_str_mv	Trajano, Douglas de Oliveira
contributor_str_mv	Bordini, Rafael Heitor
dc.subject.por.fl_str_mv	Processamento de Linguagem Natural Extração de Informações Classificação de Texto Reconhecimento de Entidades Detecção de Discurso de Ódio Linguagem Tóxica Comentário Ofensivo Segurança Online Comentário Tóxico Toxicidade Racismo Homofobia Xenofobia
topic	Processamento de Linguagem Natural Extração de Informações Classificação de Texto Reconhecimento de Entidades Detecção de Discurso de Ódio Linguagem Tóxica Comentário Ofensivo Segurança Online Comentário Tóxico Toxicidade Racismo Homofobia Xenofobia Natural Language Processing Information Extraction Text Classification NamedEntity Recognition Hate Speech Detection Toxic Language Offensive Comment Toxic Comment Toxicity Racism Homophobia Xenophobia CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Natural Language Processing Information Extraction Text Classification NamedEntity Recognition Hate Speech Detection Toxic Language Offensive Comment Toxic Comment Toxicity Racism Homophobia Xenophobia
dc.subject.cnpq.fl_str_mv	CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
description	The advent of social media has transformed the way in which individuals and communities interact and communicate. However, the messages on social media may contain expressions of opinion, and support messages, but they can also hate speech. The proliferation of hate speech in the digital sphere has become an increasingly pressing issue, with polarized opinions and a sense of anonymity and impunity among users often serving as contributing factors. The haters, users who spread hate speech, can be found in a variety of topics, including political discussions, entertainment, gaming, and corporate environments. The Natural Language Processing (NLP) area can contribute with tools to ensure healthy communication and protect users’ rights online. NLP applications are efficient, standardized, and automated, eliminating the need for manual moderation of such content. In this study, we used advanced machine learning and deep learning techniques to develop a toxic language detection system in Portuguese messages. The dataset used for training the models consists of 6,354 (with the possibility of extending to 13,538) comments manually annotated by experts. This dataset, made available as part of the work, has annotations for 5 NLP tasks, using a hierarchical annotation scheme with different levels of granularity. The results of the experiments demonstrate the usefulness of this dataset for the development of NLP systems aimed at detecting toxic language in texts in Portuguese.
publishDate	2023
dc.date.accessioned.fl_str_mv	2023-05-25T17:02:12Z
dc.date.issued.fl_str_mv	2023-02-27
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://tede2.pucrs.br/tede2/handle/tede/10782
url	https://tede2.pucrs.br/tede2/handle/tede/10782
dc.language.iso.fl_str_mv	por
language	por
dc.relation.program.fl_str_mv	-4570527706994352458
dc.relation.confidence.fl_str_mv	500 500
dc.relation.cnpq.fl_str_mv	-862078257083325301
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Pontifícia Universidade Católica do Rio Grande do Sul
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	PUCRS
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	Escola Politécnica
publisher.none.fl_str_mv	Pontifícia Universidade Católica do Rio Grande do Sul
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da PUC_RS instname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS) instacron:PUC_RS
instname_str	Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
instacron_str	PUC_RS
institution	PUC_RS
reponame_str	Biblioteca Digital de Teses e Dissertações da PUC_RS
collection	Biblioteca Digital de Teses e Dissertações da PUC_RS
bitstream.url.fl_str_mv	https://tede2.pucrs.br/tede2/bitstream/tede/10782/4/DOUGLAS+DE+OLIVEIRA+TRAJANO_DIS.pdf.jpg https://tede2.pucrs.br/tede2/bitstream/tede/10782/3/DOUGLAS+DE+OLIVEIRA+TRAJANO_DIS.pdf.txt https://tede2.pucrs.br/tede2/bitstream/tede/10782/2/DOUGLAS+DE+OLIVEIRA+TRAJANO_DIS.pdf https://tede2.pucrs.br/tede2/bitstream/tede/10782/1/license.txt
bitstream.checksum.fl_str_mv	b23668d436fdb29a190ff74765b1721a 2041e03b74068982ad61188e71d95f16 c0d1176f79f14a57ab325767922b4d63 220e11f2d3ba5354f917c7035aadef24
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
repository.mail.fl_str_mv	biblioteca.central@pucrs.br\|\|
_version_	1799765361091936256

Detecção de linguagem tóxica aplicada a textos em português

Registros relacionados