An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | por |
Título da fonte: | Domínios de Lingu@gem |
Texto Completo: | https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632 |
Resumo: | With the advances of the Natural Language Processing area, corpora are resources that have had a prominent place. More than subsidizing linguistic studies, they constitute the basis for training Machine Learning models and developing cutting-edge computational applications. In particular, there is a great need for annotated corpora, but their production requires another essential resource, the annotation manual, which instantiates the annotation model of interest for the language in question and outlines the annotation decisions that should be adopted. In this paper, we explore issues related to the development of manuals for the annotation of Brazilian Portuguese corpora according to the Universal Dependencies model, widely adopted in the field. We discuss the evolution of NLP and the use of corpora, the fundamental issues, resources and tools related to syntactic representation, the Universal Dependencies model, and the main decisions made in the instantiation of UD guidelines in Brazilian Portuguese. For practical and didactic reasons, we divided the manual into two parts: the PoS Tag Annotation Manual (morphosyntactic annotation) and the Dependency Relations Annotation Manual. Both resulted from the process reported in this paper and are available for free access on the POeTiSA project's Web site. |
id |
UFU-12_8854a04c9ecfaa71b63dd41a88396c5c |
---|---|
oai_identifier_str |
oai:ojs.www.seer.ufu.br:article/63632 |
network_acronym_str |
UFU-12 |
network_name_str |
Domínios de Lingu@gem |
repository_id_str |
|
spelling |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in PortugueseManual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesaCorpora anotadosManual de anotaçãoUniversal DependenciesÁrvores de dependênciaPortuguês brasileiroAnnotated corporaAnnotation manualUniversal DependenciesDependency treesBrazilian PortugueseWith the advances of the Natural Language Processing area, corpora are resources that have had a prominent place. More than subsidizing linguistic studies, they constitute the basis for training Machine Learning models and developing cutting-edge computational applications. In particular, there is a great need for annotated corpora, but their production requires another essential resource, the annotation manual, which instantiates the annotation model of interest for the language in question and outlines the annotation decisions that should be adopted. In this paper, we explore issues related to the development of manuals for the annotation of Brazilian Portuguese corpora according to the Universal Dependencies model, widely adopted in the field. We discuss the evolution of NLP and the use of corpora, the fundamental issues, resources and tools related to syntactic representation, the Universal Dependencies model, and the main decisions made in the instantiation of UD guidelines in Brazilian Portuguese. For practical and didactic reasons, we divided the manual into two parts: the PoS Tag Annotation Manual (morphosyntactic annotation) and the Dependency Relations Annotation Manual. Both resulted from the process reported in this paper and are available for free access on the POeTiSA project's Web site.Com o avanço da área de Processamento de Linguagem Natural (PLN), corpora são recursos que têm tido um lugar de destaque. Mais do que subsidiar estudos linguísticos, eles constituem as bases para o treinamento de modelos de Aprendizagem de Máquina e para o desenvolvimento de aplicações computacionais de ponta. Particularmente, há grande necessidade de corpora anotados, porém sua geração requer outro recurso essencial, o manual de anotação, que instancia o modelo de anotação de interesse para a língua em questão e delineia as decisões de anotação que devem ser adotadas. Neste artigo, exploramos questões relacionadas ao desenvolvimento de manuais para a anotação de corpus em português brasileiro segundo o modelo internacional Universal Dependencies, amplamente adotado na área. Partimos da discussão da evolução do PLN e o uso de corpora, passamos pelas questões, recursos e ferramentas fundamentais relacionados à representação sintática, discutimos o modelo Universal Dependencies e apresentamos as principais decisões tomadas na instanciação de suas diretrizes no português brasileiro. Por questões práticas e de didática, dividimos o manual em duas partes: o Manual de Anotação de PoS tags (anotação morfossintática) e o Manual de Anotação Relações de Dependência. Ambos foram resultado do processo relatado neste artigo e estão disponíveis para livre acesso no site do projeto POeTiSA na Web.PP/UFU2022-09-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdftext/xmlhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/6363210.14393/DL52-v16n4a2022-13Domínios de Lingu@gem; Vol. 16 No. 4 (2022): The computational treatment of Brazilian Portuguese; 1608-1643Domínios de Lingu@gem; Vol. 16 Núm. 4 (2022): El tratamiento computacional del portugués brasileño; 1608-1643Domínios de Lingu@gem; v. 16 n. 4 (2022): Tratamento Computacional do Português Brasileiro; 1608-16431980-5799reponame:Domínios de Lingu@geminstname:Universidade Federal de Uberlândia (UFU)instacron:UFUporhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/34631https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/35238Copyright (c) 2022 Magali Duran, Maria das Graças Volpe Nunes, Lucelene Lopes, Thiago Alexandre Salgueiro Pardohttp://creativecommons.org/licenses/by-nc-nd/4.0info:eu-repo/semantics/openAccessDuran, MagaliNunes, Maria das Graças VolpeLopes, LucelenePardo, Thiago Alexandre Salgueiro2022-12-09T18:36:45Zoai:ojs.www.seer.ufu.br:article/63632Revistahttps://seer.ufu.br/index.php/dominiosdelinguagemPUBhttps://seer.ufu.br/index.php/dominiosdelinguagem/oairevistadominios@ileel.ufu.br||1980-57991980-5799opendoar:2022-12-09T18:36:45Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)false |
dc.title.none.fl_str_mv |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa |
title |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese |
spellingShingle |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese Duran, Magali Corpora anotados Manual de anotação Universal Dependencies Árvores de dependência Português brasileiro Annotated corpora Annotation manual Universal Dependencies Dependency trees Brazilian Portuguese |
title_short |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese |
title_full |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese |
title_fullStr |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese |
title_full_unstemmed |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese |
title_sort |
An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese |
author |
Duran, Magali |
author_facet |
Duran, Magali Nunes, Maria das Graças Volpe Lopes, Lucelene Pardo, Thiago Alexandre Salgueiro |
author_role |
author |
author2 |
Nunes, Maria das Graças Volpe Lopes, Lucelene Pardo, Thiago Alexandre Salgueiro |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Duran, Magali Nunes, Maria das Graças Volpe Lopes, Lucelene Pardo, Thiago Alexandre Salgueiro |
dc.subject.por.fl_str_mv |
Corpora anotados Manual de anotação Universal Dependencies Árvores de dependência Português brasileiro Annotated corpora Annotation manual Universal Dependencies Dependency trees Brazilian Portuguese |
topic |
Corpora anotados Manual de anotação Universal Dependencies Árvores de dependência Português brasileiro Annotated corpora Annotation manual Universal Dependencies Dependency trees Brazilian Portuguese |
description |
With the advances of the Natural Language Processing area, corpora are resources that have had a prominent place. More than subsidizing linguistic studies, they constitute the basis for training Machine Learning models and developing cutting-edge computational applications. In particular, there is a great need for annotated corpora, but their production requires another essential resource, the annotation manual, which instantiates the annotation model of interest for the language in question and outlines the annotation decisions that should be adopted. In this paper, we explore issues related to the development of manuals for the annotation of Brazilian Portuguese corpora according to the Universal Dependencies model, widely adopted in the field. We discuss the evolution of NLP and the use of corpora, the fundamental issues, resources and tools related to syntactic representation, the Universal Dependencies model, and the main decisions made in the instantiation of UD guidelines in Brazilian Portuguese. For practical and didactic reasons, we divided the manual into two parts: the PoS Tag Annotation Manual (morphosyntactic annotation) and the Dependency Relations Annotation Manual. Both resulted from the process reported in this paper and are available for free access on the POeTiSA project's Web site. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-09-12 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632 10.14393/DL52-v16n4a2022-13 |
url |
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632 |
identifier_str_mv |
10.14393/DL52-v16n4a2022-13 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.none.fl_str_mv |
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/34631 https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/35238 |
dc.rights.driver.fl_str_mv |
http://creativecommons.org/licenses/by-nc-nd/4.0 info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-nd/4.0 |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf text/xml |
dc.publisher.none.fl_str_mv |
PP/UFU |
publisher.none.fl_str_mv |
PP/UFU |
dc.source.none.fl_str_mv |
Domínios de Lingu@gem; Vol. 16 No. 4 (2022): The computational treatment of Brazilian Portuguese; 1608-1643 Domínios de Lingu@gem; Vol. 16 Núm. 4 (2022): El tratamiento computacional del portugués brasileño; 1608-1643 Domínios de Lingu@gem; v. 16 n. 4 (2022): Tratamento Computacional do Português Brasileiro; 1608-1643 1980-5799 reponame:Domínios de Lingu@gem instname:Universidade Federal de Uberlândia (UFU) instacron:UFU |
instname_str |
Universidade Federal de Uberlândia (UFU) |
instacron_str |
UFU |
institution |
UFU |
reponame_str |
Domínios de Lingu@gem |
collection |
Domínios de Lingu@gem |
repository.name.fl_str_mv |
Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU) |
repository.mail.fl_str_mv |
revistadominios@ileel.ufu.br|| |
_version_ |
1797067717702844416 |