Humor and offense speech classification and scoring using natural language processing

Mathias, Marcelo Custódio

Humor and offense speech classification and scoring using natural language processing

Detalhes bibliográficos
Autor(a) principal:	Mathias, Marcelo Custódio
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10071/26655
Resumo:	Identifying humor and offense may prove to be an arduous task even for humans. It is, however, even more challenging to translate it into a logical process that a machine can understand. This work pretends to develop machine learning models which will be implemented to achieve this task. On this track, this study will be based on the SemEval 2021 workshop, where the participants were challenged to identify and score both humor and offense texts, as well as detect controversial sentences (SemEval 2021 - Task 7 - Detecting and Rating Humor and Offense), encouraging the use of current state-of-the-art algorithmic techniques in Natural Language Processing. The objective is to identify and propose the most optimal setup to achieve the highest performance on Humor Detection and related tasks using a common dataset aggregating eight thousand sentences classified with their respective binary humor indicator and humor rating, along with binary controversial indicators and offense rating values. This document presents a solution for the presented tasks based on BERT (Bidirectional Encoder Representations from Transformers) which makes use of Transformers interpreting the sentences in both directions (bidirectional), which brings a much higher context perception into the model. It will compare the performance of three different BERT variants (BERTBASE, DistillBERT, and RoBERTa), each of them designed for better fit on different tasks used by industry and academia. Concluding that DistillBERT presented the most accurate results in the Humor Detection and Humor Rating tasks, while RoBERTa performed best in the controversial detection task. Finally, BERTBASE outperformed in the Offensiveness Ranking task.

Metadados do item

id	RCAP_c9cc5fecc10fb1cfebb1455c3ce92290
oai_identifier_str	oai:repositorio.iscte-iul.pt:10071/26655
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Humor and offense speech classification and scoring using natural language processingHumor detectionNLPBERT for humorControversiality detectionOffensiveness detectionDetecção de humorPNLBERTDetecção controvérsiaDetecção ofensaIdentifying humor and offense may prove to be an arduous task even for humans. It is, however, even more challenging to translate it into a logical process that a machine can understand. This work pretends to develop machine learning models which will be implemented to achieve this task. On this track, this study will be based on the SemEval 2021 workshop, where the participants were challenged to identify and score both humor and offense texts, as well as detect controversial sentences (SemEval 2021 - Task 7 - Detecting and Rating Humor and Offense), encouraging the use of current state-of-the-art algorithmic techniques in Natural Language Processing. The objective is to identify and propose the most optimal setup to achieve the highest performance on Humor Detection and related tasks using a common dataset aggregating eight thousand sentences classified with their respective binary humor indicator and humor rating, along with binary controversial indicators and offense rating values. This document presents a solution for the presented tasks based on BERT (Bidirectional Encoder Representations from Transformers) which makes use of Transformers interpreting the sentences in both directions (bidirectional), which brings a much higher context perception into the model. It will compare the performance of three different BERT variants (BERTBASE, DistillBERT, and RoBERTa), each of them designed for better fit on different tasks used by industry and academia. Concluding that DistillBERT presented the most accurate results in the Humor Detection and Humor Rating tasks, while RoBERTa performed best in the controversial detection task. Finally, BERTBASE outperformed in the Offensiveness Ranking task.A identificação do humor e ofensa pode revelar-se uma tarefa árdua mesmo para os humanos. No entanto, é ainda mais desafiante traduzi-lo num processo lógico que uma máquina possa compreender. Este trabalho pretende desenvolver modelos de aprendizagem automática que serão implementados para cumprir esta tarefa. Este estudo será baseado no workshop SemEval 2021, onde os participantes foram desafiados a detectar e classificar sentenças em relação ao humor e ofensividade, bem como detectar frases controversas (SemEval 2021 - Tarefa 7 - Detecção e Classificação de Humor e Ofensa), encorajando a utilização de estratégias algorítmicas de última geração focadas no processamento computacional da língua. O objectivo é identificar e propor a melhor configuração para alcançar o melhor desempenho na Detecção de Humor e tarefas relacionadas, utilizando um conjunto de dados comum que agrega oito mil sentenças classificadas com os respectivos identificadores binário de humor e classificação, juntamente com os identificadores binários de controversas e classificação de ofensas. Este documento apresenta uma solução para as tarefas apresentadas baseada no BERT (Bidirectional Encoder Representations from Transformers) que faz uso de Transformers, uma arquitetura de rede neuronais que permite interpretar as sentenças em ambos os sentidos (bidireccional), o que traz uma melhor percepção de contexto quando comparada com outras arquiteturas. Este estudo compara o desempenho de três variantes de BERT (BERTBASE, DistillBERT, and RoBERTa), cada uma delas concebida para se adaptar melhor às diferentes tarefas utilizadas pela indústria e pelo meio académico. Concluiu-se que DistillBERT apresentou o melhor desempenho nas tarefas de Detecção de Humor e Classificação de Humor, enquanto RoBERTa foi mais preciso na tarefa de detecção de frases controversas. Finalmente, BERTBASE obteve a melhor performance na tarefa de Classificação de Ofensividade.2022-12-16T10:06:50Z2022-12-05T00:00:00Z2022-12-052022-10info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10071/26655TID:203120639engMathias, Marcelo Custódioinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:55:49Zoai:repositorio.iscte-iul.pt:10071/26655Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:28:31.388084Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Humor and offense speech classification and scoring using natural language processing
title	Humor and offense speech classification and scoring using natural language processing
spellingShingle	Humor and offense speech classification and scoring using natural language processing Mathias, Marcelo Custódio Humor detection NLP BERT for humor Controversiality detection Offensiveness detection Detecção de humor PNL BERT Detecção controvérsia Detecção ofensa
title_short	Humor and offense speech classification and scoring using natural language processing
title_full	Humor and offense speech classification and scoring using natural language processing
title_fullStr	Humor and offense speech classification and scoring using natural language processing
title_full_unstemmed	Humor and offense speech classification and scoring using natural language processing
title_sort	Humor and offense speech classification and scoring using natural language processing
author	Mathias, Marcelo Custódio
author_facet	Mathias, Marcelo Custódio
author_role	author
dc.contributor.author.fl_str_mv	Mathias, Marcelo Custódio
dc.subject.por.fl_str_mv	Humor detection NLP BERT for humor Controversiality detection Offensiveness detection Detecção de humor PNL BERT Detecção controvérsia Detecção ofensa
topic	Humor detection NLP BERT for humor Controversiality detection Offensiveness detection Detecção de humor PNL BERT Detecção controvérsia Detecção ofensa
description	Identifying humor and offense may prove to be an arduous task even for humans. It is, however, even more challenging to translate it into a logical process that a machine can understand. This work pretends to develop machine learning models which will be implemented to achieve this task. On this track, this study will be based on the SemEval 2021 workshop, where the participants were challenged to identify and score both humor and offense texts, as well as detect controversial sentences (SemEval 2021 - Task 7 - Detecting and Rating Humor and Offense), encouraging the use of current state-of-the-art algorithmic techniques in Natural Language Processing. The objective is to identify and propose the most optimal setup to achieve the highest performance on Humor Detection and related tasks using a common dataset aggregating eight thousand sentences classified with their respective binary humor indicator and humor rating, along with binary controversial indicators and offense rating values. This document presents a solution for the presented tasks based on BERT (Bidirectional Encoder Representations from Transformers) which makes use of Transformers interpreting the sentences in both directions (bidirectional), which brings a much higher context perception into the model. It will compare the performance of three different BERT variants (BERTBASE, DistillBERT, and RoBERTa), each of them designed for better fit on different tasks used by industry and academia. Concluding that DistillBERT presented the most accurate results in the Humor Detection and Humor Rating tasks, while RoBERTa performed best in the controversial detection task. Finally, BERTBASE outperformed in the Offensiveness Ranking task.
publishDate	2022
dc.date.none.fl_str_mv	2022-12-16T10:06:50Z 2022-12-05T00:00:00Z 2022-12-05 2022-10
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10071/26655 TID:203120639
url	http://hdl.handle.net/10071/26655
identifier_str_mv	TID:203120639
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134847949602816

Humor and offense speech classification and scoring using natural language processing

Registros relacionados