Offensive language classification in social media: using deep learning

Wang, Susan Wei Chen

Offensive language classification in social media: using deep learning

Detalhes bibliográficos
Autor(a) principal:	Wang, Susan Wei Chen
Data de Publicação:	2020
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10362/106926
Resumo:	Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics

Metadados do item

id	RCAP_722c23fa8e7169c3d47315cbc5b71f61
oai_identifier_str	oai:run.unl.pt:10362/106926
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Offensive language classification in social media: using deep learningOffensive LanguageHate SpeechToxic LanguageAbusive LanguageSocial MediaTwitterBERTTransformersText ClassificationNLPNatural Language ProcessingDeep LearningEnsemblesDiscurso de OdioLinguagem OfensivaLinguagem AbusivaLinguagem TóxicaDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsAs social media usage becomes more integrated into our daily lives, the impact of online abuse also becomes more prevalent. Research in the area of Offensive Language Classification are numerous and often occur in parrallel. Offensive Language Identification Dataset (OLID) schema was introduced with the aim of consolidating related tasks by categorising offense into a three-level hierarchy - detection of offensive posts (Level A), distinguishing between targeted and untargeted offenses (Level B) and then identifying the target of the offense (Level C). This thesis presents our contribution to the Offensive Language Classification Task (English SubTask A) of OffensEval 2020, and a follow-up study of Offense Type Classification (subTask B) and Offense Target Identification (subTask C) of OffensEval 2019. These tasks follow the OLID schema where each level corresponds to an individual subtask. For subtask A, the dataset is examined in detail and the most uncertain partitions are removed by an under-sampling technique of the training set. We improved model performance by increasing data quality, taking advantage of further offensive language classification datasets. We fine-tuned separate BERT models from individual datasets and experimented with different ensemble approaches including SVMs, Gradient boosting, AdaBoosting and Logistic Regression to achieve a final ensemble classification model that enhanced macro-F1 score. Our best model, an average ensemble of four different Bert models, achieved 11th place out of 82 participants with a macro F1 score of 0.91344 in the English SubTask A. The dataset for subtask B and C are highly unbalanced, and modification of the classification thresholds improved classifier performance of the minority classes, which in turn improved the overall performance. Again using the BERT architecture, the models achieved macro-F1 scores of 0.71367 for subTask B and 0.643352 for subTask C, equivalent to the 5th and 2nd places in the respective tasks. We showed that BERT is an effective architecture for offensive language classification and propose further performance gains are possible by improving data quality.Conforme o uso da Social Media se torna mais integrado no nosso dia-a-dia, o impacto do abuso online torna-se também mais prevalente. Pesquisas na área de Classificação de Linguagem Ofensiva são numerosas e ocorrem frequentemente em paralelo. O esquema Offensive Language Identification Dataset (OLID) foi introduzido com o objectivo de consolidar tarefas relacionadas com a categorização de ofensas numa hierarquia de três níveis - detecção de posts ofensivos (nível A), distinção entre ofensas directas e indirectas (nível B) e posteriormente a identificação do visado pela ofensa (nível C). Esta tese apresenta a nossa contribuição à Offensive Language Classification Task (English sub-tarefa A) da OffensEval 2020, e um subsequente estudo de Offense Type Classification (sub-tarefa B) e Offense Target Identification (sub-tarefa C) da OffensEval 2019. Estas tarefas seguem o esquema OLID onde cada nível corresponde a uma tarefa individual. Para a sub-tarefa A, o conjunto de informação é examinado em detalhe e as partições mais incertas são removidas por uma técnica de sub-amostragem do conjunto de treinamento. Melhoramos também o desempenho ao melhorar a qualidade da informação, aproveitando de conjuntos mais recentes de classificação de linguagem ofensiva. Ajustamos modelos BERT disjuntos através de conjuntos de informação individuais e experimentamos com diferentes junções incluindo SVMs, Gradient boosting, AdaBoosting e Regressão Logística para alcançar /* um modelo classificação junção final */ que melhorou a pontuação macro-F1. O nosso melhor modelo, uma junção média de quatro modelos Bert diferentes, alcançou o 11º de 82 participantes com uma pontuação macro de 0,91344 na sub-tarefa A de Inglês. O conjunto de informação para a sub-tarefa B e C são altamente desequilibrados, e modificar os limiares de classificação melhorou o desempenho de classes minoria, que por sua vez melhoraram o desempenho no geral. Novamente usando a arquitectura BERT, os modelos alcançaram pontuações macro-F1 de 0,71367 para a sub-tarefa B e 0.643352 para a sub-tarefa C, equivalente ao 5º e 2º lugares nas tarefas respectivas. Mostrámos que a arquitectura BERT é eficaz para classificação de linguagem ofensiva e propomos que é possível ganhar desempenho através da melhoria da qualidade da informação.Marinho, Zita Alexandra MagalhãesRUNWang, Susan Wei Chen2020-11-11T12:01:50Z2020-10-152020-10-15T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/106926TID:202535924enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:51:41Zoai:run.unl.pt:10362/106926Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:40:48.949464Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Offensive language classification in social media: using deep learning
title	Offensive language classification in social media: using deep learning
spellingShingle	Offensive language classification in social media: using deep learning Wang, Susan Wei Chen Offensive Language Hate Speech Toxic Language Abusive Language Social Media Twitter BERT Transformers Text Classification NLP Natural Language Processing Deep Learning Ensembles Discurso de Odio Linguagem Ofensiva Linguagem Abusiva Linguagem Tóxica
title_short	Offensive language classification in social media: using deep learning
title_full	Offensive language classification in social media: using deep learning
title_fullStr	Offensive language classification in social media: using deep learning
title_full_unstemmed	Offensive language classification in social media: using deep learning
title_sort	Offensive language classification in social media: using deep learning
author	Wang, Susan Wei Chen
author_facet	Wang, Susan Wei Chen
author_role	author
dc.contributor.none.fl_str_mv	Marinho, Zita Alexandra Magalhães RUN
dc.contributor.author.fl_str_mv	Wang, Susan Wei Chen
dc.subject.por.fl_str_mv	Offensive Language Hate Speech Toxic Language Abusive Language Social Media Twitter BERT Transformers Text Classification NLP Natural Language Processing Deep Learning Ensembles Discurso de Odio Linguagem Ofensiva Linguagem Abusiva Linguagem Tóxica
topic	Offensive Language Hate Speech Toxic Language Abusive Language Social Media Twitter BERT Transformers Text Classification NLP Natural Language Processing Deep Learning Ensembles Discurso de Odio Linguagem Ofensiva Linguagem Abusiva Linguagem Tóxica
description	Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
publishDate	2020
dc.date.none.fl_str_mv	2020-11-11T12:01:50Z 2020-10-15 2020-10-15T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/106926 TID:202535924
url	http://hdl.handle.net/10362/106926
identifier_str_mv	TID:202535924
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799138022132809728

Offensive language classification in social media: using deep learning

Registros relacionados