End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/139557 |
Resumo: | In this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6]. |
id |
RCAP_587dbc4d2a3fc90d69a82f5c6423a79b |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/139557 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
End-to-End Pipeline For Analysing Media Coverage of Corruption in PortugalCorruptionMediaBig DataLocal GovernanceDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaIn this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6].Neste trabalho propõe-se uma pipeline de dados para a coleta e análise de artigos jornalísticos acerca de corrupção e criminalidade conexa presentes nos jornais e revistas Portugueses que colaboraram com este projeto. A nossa metodologia prende-se pela utilização de texto dos media para realizar análises referentes a percepções de corrupção, sendo que até então se tem recorrido principalmente a questionários e índices de especialistas [1]. Em concreto, recorremos à API Mediacloud [2], através da sua coleção geográfica Portugal -National, e a técnicas de webscraping por forma a construir uma base de dados relacional com 18119 artigos, provenientes de 14 meios de comunicação e referentes ao período: 01/01/2015 - 31/12/2020. Comparámos dois modelos de Reconhecimento de EntidadesMencionadas pré-treinados numa tarefa interna de anotação manual, na qual se escolheu um destes para extrair informação automaticamente dos artigos recolhidos. Nomeadamente, informação correspondente às categorias do cenário selectivo do Segundo HAREM [3] - organizações, pessoas, localizações, datas e valores. Adicionalmente, apresentamos três propostas de investigação desenhadas para a recolha de conhecimento a partir da base de dados criada e simplificação do seu uso: • aferição da frequência e "seriedade"(fundadas vs. infundadas) das notícias de casos de corrupção ao nível local, tal como estas são apresentadas nos media Portugueses, replicando a coleta de dados de [4]; • criação de uma lista de Operações/Processos a partir de títulos da base de dados; • agregação de notícias similares, pela experimentação com modelos de tópicos - LDA [5] e Top2Vec [6].Rodrigues, RuiPeralta, SusanaRUNMarques, Afonso Manuel Cunha2022-06-07T10:01:02Z2022-022022-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/139557enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:16:49Zoai:run.unl.pt:10362/139557Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:49:27.323119Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
title |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
spellingShingle |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal Marques, Afonso Manuel Cunha Corruption Media Big Data Local Governance Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
title_full |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
title_fullStr |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
title_full_unstemmed |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
title_sort |
End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal |
author |
Marques, Afonso Manuel Cunha |
author_facet |
Marques, Afonso Manuel Cunha |
author_role |
author |
dc.contributor.none.fl_str_mv |
Rodrigues, Rui Peralta, Susana RUN |
dc.contributor.author.fl_str_mv |
Marques, Afonso Manuel Cunha |
dc.subject.por.fl_str_mv |
Corruption Media Big Data Local Governance Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
Corruption Media Big Data Local Governance Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
In this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6]. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-06-07T10:01:02Z 2022-02 2022-02-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/139557 |
url |
http://hdl.handle.net/10362/139557 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138093422346240 |