The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts

Detalhes bibliográficos
Autor(a) principal: Inácio, Marcio Lima
Data de Publicação: 2022
Outros Autores: Cabezudo, Marco Antonio Sobrevilla, Ramisch, Renata, Di Felippo, Ariani, Pardo, Thiago Alexandre Salgueiro
Tipo de documento: preprint
Idioma: eng
Título da fonte: SciELO Preprints
Texto Completo: https://preprints.scielo.org/index.php/scielo/preprint/view/4652
Resumo: One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.
id SCI-1_98f3d7a0d8bf7c29f504b26a699b0b5a
oai_identifier_str oai:ops.preprints.scielo.org:preprint/4652
network_acronym_str SCI-1
network_name_str SciELO Preprints
repository_id_str
spelling The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion textsO corpus AMR-PT e a anotação semântica de sentenças desafiadoras de textos jornalísticos e opinativosanotação de corpusrepresentação de conhecimentosemânticacorpus annotationknowledge representationsemanticsOne of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.Abstract Meaning Representation (AMR) é uma linguagem de representação semântica bastante popular em processamento de línguas naturais (PLN). Ela codifica o significado das sentenças em grafos orientados (enraizados). Para o inglês, há um grande corpus com anotação AMR que subsidia métodos e aplicações de PLN. Para a anotação de corpora em línguas que não sejam o inglês, incluindo o português brasileiro, têm-se aplicado estratégias automáticas ou manuais. As automáticas se baseiam essencialmente no alinhamento entre corpora paralelos e na herança da anotação AMR, enquanto as estratégias manuais focalizam na adaptação das diretrizes originais de anotação AMR (para o inglês) em função da língua-alvo. Ambas as estratégias, automática ou manual, precisam lidar com certos fenômenos linguísticos desafiadores. Neste trabalho, exploram-se características do português para as quais o modelo AMR foi adaptado e apresentam-se dois corpora anotados: AMRNews, corpus composto por 870 sentenças anotadas, provenientes de textos jornalísticos, e o corpus OpiSums-PT-AMR, contendo 404 sentenças opinativas em AMR.SciELO PreprintsSciELO PreprintsSciELO Preprints2022-08-30info:eu-repo/semantics/preprintinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://preprints.scielo.org/index.php/scielo/preprint/view/465210.1590/1678-460x202255159enghttps://preprints.scielo.org/index.php/scielo/article/view/4652/9050Copyright (c) 2022 Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardohttps://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessInácio, Marcio LimaCabezudo, Marco Antonio SobrevillaRamisch, RenataDi Felippo, ArianiPardo, Thiago Alexandre Salgueiroreponame:SciELO Preprintsinstname:SciELOinstacron:SCI2022-08-30T20:24:58Zoai:ops.preprints.scielo.org:preprint/4652Servidor de preprintshttps://preprints.scielo.org/index.php/scieloONGhttps://preprints.scielo.org/index.php/scielo/oaiscielo.submission@scielo.orgopendoar:2022-08-30T20:24:58SciELO Preprints - SciELOfalse
dc.title.none.fl_str_mv The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
O corpus AMR-PT e a anotação semântica de sentenças desafiadoras de textos jornalísticos e opinativos
title The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
spellingShingle The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
Inácio, Marcio Lima
anotação de corpus
representação de conhecimento
semântica
corpus annotation
knowledge representation
semantics
title_short The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
title_full The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
title_fullStr The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
title_full_unstemmed The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
title_sort The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
author Inácio, Marcio Lima
author_facet Inácio, Marcio Lima
Cabezudo, Marco Antonio Sobrevilla
Ramisch, Renata
Di Felippo, Ariani
Pardo, Thiago Alexandre Salgueiro
author_role author
author2 Cabezudo, Marco Antonio Sobrevilla
Ramisch, Renata
Di Felippo, Ariani
Pardo, Thiago Alexandre Salgueiro
author2_role author
author
author
author
dc.contributor.author.fl_str_mv Inácio, Marcio Lima
Cabezudo, Marco Antonio Sobrevilla
Ramisch, Renata
Di Felippo, Ariani
Pardo, Thiago Alexandre Salgueiro
dc.subject.por.fl_str_mv anotação de corpus
representação de conhecimento
semântica
corpus annotation
knowledge representation
semantics
topic anotação de corpus
representação de conhecimento
semântica
corpus annotation
knowledge representation
semantics
description One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.
publishDate 2022
dc.date.none.fl_str_mv 2022-08-30
dc.type.driver.fl_str_mv info:eu-repo/semantics/preprint
info:eu-repo/semantics/publishedVersion
format preprint
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://preprints.scielo.org/index.php/scielo/preprint/view/4652
10.1590/1678-460x202255159
url https://preprints.scielo.org/index.php/scielo/preprint/view/4652
identifier_str_mv 10.1590/1678-460x202255159
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://preprints.scielo.org/index.php/scielo/article/view/4652/9050
dc.rights.driver.fl_str_mv https://creativecommons.org/licenses/by/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv SciELO Preprints
SciELO Preprints
SciELO Preprints
publisher.none.fl_str_mv SciELO Preprints
SciELO Preprints
SciELO Preprints
dc.source.none.fl_str_mv reponame:SciELO Preprints
instname:SciELO
instacron:SCI
instname_str SciELO
instacron_str SCI
institution SCI
reponame_str SciELO Preprints
collection SciELO Preprints
repository.name.fl_str_mv SciELO Preprints - SciELO
repository.mail.fl_str_mv scielo.submission@scielo.org
_version_ 1797047829825323008