End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal

Detalhes bibliográficos
Autor(a) principal: Marques, Afonso Manuel Cunha
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/139557
Resumo: In this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6].
id RCAP_587dbc4d2a3fc90d69a82f5c6423a79b
oai_identifier_str oai:run.unl.pt:10362/139557
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling End-to-End Pipeline For Analysing Media Coverage of Corruption in PortugalCorruptionMediaBig DataLocal GovernanceDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaIn this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6].Neste trabalho propõe-se uma pipeline de dados para a coleta e análise de artigos jornalísticos acerca de corrupção e criminalidade conexa presentes nos jornais e revistas Portugueses que colaboraram com este projeto. A nossa metodologia prende-se pela utilização de texto dos media para realizar análises referentes a percepções de corrupção, sendo que até então se tem recorrido principalmente a questionários e índices de especialistas [1]. Em concreto, recorremos à API Mediacloud [2], através da sua coleção geográfica Portugal -National, e a técnicas de webscraping por forma a construir uma base de dados relacional com 18119 artigos, provenientes de 14 meios de comunicação e referentes ao período: 01/01/2015 - 31/12/2020. Comparámos dois modelos de Reconhecimento de EntidadesMencionadas pré-treinados numa tarefa interna de anotação manual, na qual se escolheu um destes para extrair informação automaticamente dos artigos recolhidos. Nomeadamente, informação correspondente às categorias do cenário selectivo do Segundo HAREM [3] - organizações, pessoas, localizações, datas e valores. Adicionalmente, apresentamos três propostas de investigação desenhadas para a recolha de conhecimento a partir da base de dados criada e simplificação do seu uso: • aferição da frequência e "seriedade"(fundadas vs. infundadas) das notícias de casos de corrupção ao nível local, tal como estas são apresentadas nos media Portugueses, replicando a coleta de dados de [4]; • criação de uma lista de Operações/Processos a partir de títulos da base de dados; • agregação de notícias similares, pela experimentação com modelos de tópicos - LDA [5] e Top2Vec [6].Rodrigues, RuiPeralta, SusanaRUNMarques, Afonso Manuel Cunha2022-06-07T10:01:02Z2022-022022-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/139557enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:16:49Zoai:run.unl.pt:10362/139557Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:49:27.323119Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
title End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
spellingShingle End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
Marques, Afonso Manuel Cunha
Corruption
Media
Big Data
Local Governance
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
title_full End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
title_fullStr End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
title_full_unstemmed End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
title_sort End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
author Marques, Afonso Manuel Cunha
author_facet Marques, Afonso Manuel Cunha
author_role author
dc.contributor.none.fl_str_mv Rodrigues, Rui
Peralta, Susana
RUN
dc.contributor.author.fl_str_mv Marques, Afonso Manuel Cunha
dc.subject.por.fl_str_mv Corruption
Media
Big Data
Local Governance
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic Corruption
Media
Big Data
Local Governance
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description In this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6].
publishDate 2022
dc.date.none.fl_str_mv 2022-06-07T10:01:02Z
2022-02
2022-02-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/139557
url http://hdl.handle.net/10362/139557
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138093422346240