The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , , , |
Tipo de documento: | preprint |
Idioma: | eng |
Título da fonte: | SciELO Preprints |
Texto Completo: | https://preprints.scielo.org/index.php/scielo/preprint/view/4652 |
Resumo: | One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR. |
id |
SCI-1_98f3d7a0d8bf7c29f504b26a699b0b5a |
---|---|
oai_identifier_str |
oai:ops.preprints.scielo.org:preprint/4652 |
network_acronym_str |
SCI-1 |
network_name_str |
SciELO Preprints |
repository_id_str |
|
spelling |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion textsO corpus AMR-PT e a anotação semântica de sentenças desafiadoras de textos jornalísticos e opinativosanotação de corpusrepresentação de conhecimentosemânticacorpus annotationknowledge representationsemanticsOne of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.Abstract Meaning Representation (AMR) é uma linguagem de representação semântica bastante popular em processamento de línguas naturais (PLN). Ela codifica o significado das sentenças em grafos orientados (enraizados). Para o inglês, há um grande corpus com anotação AMR que subsidia métodos e aplicações de PLN. Para a anotação de corpora em línguas que não sejam o inglês, incluindo o português brasileiro, têm-se aplicado estratégias automáticas ou manuais. As automáticas se baseiam essencialmente no alinhamento entre corpora paralelos e na herança da anotação AMR, enquanto as estratégias manuais focalizam na adaptação das diretrizes originais de anotação AMR (para o inglês) em função da língua-alvo. Ambas as estratégias, automática ou manual, precisam lidar com certos fenômenos linguísticos desafiadores. Neste trabalho, exploram-se características do português para as quais o modelo AMR foi adaptado e apresentam-se dois corpora anotados: AMRNews, corpus composto por 870 sentenças anotadas, provenientes de textos jornalísticos, e o corpus OpiSums-PT-AMR, contendo 404 sentenças opinativas em AMR.SciELO PreprintsSciELO PreprintsSciELO Preprints2022-08-30info:eu-repo/semantics/preprintinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://preprints.scielo.org/index.php/scielo/preprint/view/465210.1590/1678-460x202255159enghttps://preprints.scielo.org/index.php/scielo/article/view/4652/9050Copyright (c) 2022 Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardohttps://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessInácio, Marcio LimaCabezudo, Marco Antonio SobrevillaRamisch, RenataDi Felippo, ArianiPardo, Thiago Alexandre Salgueiroreponame:SciELO Preprintsinstname:SciELOinstacron:SCI2022-08-30T20:24:58Zoai:ops.preprints.scielo.org:preprint/4652Servidor de preprintshttps://preprints.scielo.org/index.php/scieloONGhttps://preprints.scielo.org/index.php/scielo/oaiscielo.submission@scielo.orgopendoar:2022-08-30T20:24:58SciELO Preprints - SciELOfalse |
dc.title.none.fl_str_mv |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts O corpus AMR-PT e a anotação semântica de sentenças desafiadoras de textos jornalísticos e opinativos |
title |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts |
spellingShingle |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts Inácio, Marcio Lima anotação de corpus representação de conhecimento semântica corpus annotation knowledge representation semantics |
title_short |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts |
title_full |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts |
title_fullStr |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts |
title_full_unstemmed |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts |
title_sort |
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts |
author |
Inácio, Marcio Lima |
author_facet |
Inácio, Marcio Lima Cabezudo, Marco Antonio Sobrevilla Ramisch, Renata Di Felippo, Ariani Pardo, Thiago Alexandre Salgueiro |
author_role |
author |
author2 |
Cabezudo, Marco Antonio Sobrevilla Ramisch, Renata Di Felippo, Ariani Pardo, Thiago Alexandre Salgueiro |
author2_role |
author author author author |
dc.contributor.author.fl_str_mv |
Inácio, Marcio Lima Cabezudo, Marco Antonio Sobrevilla Ramisch, Renata Di Felippo, Ariani Pardo, Thiago Alexandre Salgueiro |
dc.subject.por.fl_str_mv |
anotação de corpus representação de conhecimento semântica corpus annotation knowledge representation semantics |
topic |
anotação de corpus representação de conhecimento semântica corpus annotation knowledge representation semantics |
description |
One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-08-30 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/preprint info:eu-repo/semantics/publishedVersion |
format |
preprint |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://preprints.scielo.org/index.php/scielo/preprint/view/4652 10.1590/1678-460x202255159 |
url |
https://preprints.scielo.org/index.php/scielo/preprint/view/4652 |
identifier_str_mv |
10.1590/1678-460x202255159 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
https://preprints.scielo.org/index.php/scielo/article/view/4652/9050 |
dc.rights.driver.fl_str_mv |
https://creativecommons.org/licenses/by/4.0 info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/4.0 |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
SciELO Preprints SciELO Preprints SciELO Preprints |
publisher.none.fl_str_mv |
SciELO Preprints SciELO Preprints SciELO Preprints |
dc.source.none.fl_str_mv |
reponame:SciELO Preprints instname:SciELO instacron:SCI |
instname_str |
SciELO |
instacron_str |
SCI |
institution |
SCI |
reponame_str |
SciELO Preprints |
collection |
SciELO Preprints |
repository.name.fl_str_mv |
SciELO Preprints - SciELO |
repository.mail.fl_str_mv |
scielo.submission@scielo.org |
_version_ |
1797047829825323008 |