An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese

Detalhes bibliográficos
Autor(a) principal: Duran, Magali
Data de Publicação: 2022
Outros Autores: Nunes, Maria das Graças Volpe, Lopes, Lucelene, Pardo, Thiago Alexandre Salgueiro
Tipo de documento: Artigo
Idioma: por
Título da fonte: Domínios de Lingu@gem
Texto Completo: https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632
Resumo: With the advances of the Natural Language Processing area, corpora are resources that have had a prominent place. More than subsidizing linguistic studies, they constitute the basis for training Machine Learning models and developing cutting-edge computational applications. In particular, there is a great need for annotated corpora, but their production requires another essential resource, the annotation manual, which instantiates the annotation model of interest for the language in question and outlines the annotation decisions that should be adopted. In this paper, we explore issues related to the development of manuals for the annotation of Brazilian Portuguese corpora according to the Universal Dependencies model, widely adopted in the field. We discuss the evolution of NLP and the use of corpora, the fundamental issues, resources and tools related to syntactic representation, the Universal Dependencies model, and the main decisions made in the instantiation of UD guidelines in Brazilian Portuguese. For practical and didactic reasons, we divided the manual into two parts: the PoS Tag Annotation Manual (morphosyntactic annotation) and the Dependency Relations Annotation Manual. Both resulted from the process reported in this paper and are available for free access on the POeTiSA project's Web site.
id UFU-12_8854a04c9ecfaa71b63dd41a88396c5c
oai_identifier_str oai:ojs.www.seer.ufu.br:article/63632
network_acronym_str UFU-12
network_name_str Domínios de Lingu@gem
repository_id_str
spelling An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in PortugueseManual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesaCorpora anotadosManual de anotaçãoUniversal DependenciesÁrvores de dependênciaPortuguês brasileiroAnnotated corporaAnnotation manualUniversal DependenciesDependency treesBrazilian PortugueseWith the advances of the Natural Language Processing area, corpora are resources that have had a prominent place. More than subsidizing linguistic studies, they constitute the basis for training Machine Learning models and developing cutting-edge computational applications. In particular, there is a great need for annotated corpora, but their production requires another essential resource, the annotation manual, which instantiates the annotation model of interest for the language in question and outlines the annotation decisions that should be adopted. In this paper, we explore issues related to the development of manuals for the annotation of Brazilian Portuguese corpora according to the Universal Dependencies model, widely adopted in the field. We discuss the evolution of NLP and the use of corpora, the fundamental issues, resources and tools related to syntactic representation, the Universal Dependencies model, and the main decisions made in the instantiation of UD guidelines in Brazilian Portuguese. For practical and didactic reasons, we divided the manual into two parts: the PoS Tag Annotation Manual (morphosyntactic annotation) and the Dependency Relations Annotation Manual. Both resulted from the process reported in this paper and are available for free access on the POeTiSA project's Web site.Com o avanço da área de Processamento de Linguagem Natural (PLN), corpora são recursos que têm tido um lugar de destaque. Mais do que subsidiar estudos linguísticos, eles constituem as bases para o treinamento de modelos de Aprendizagem de Máquina e para o desenvolvimento de aplicações computacionais de ponta. Particularmente, há grande necessidade de corpora anotados, porém sua geração requer outro recurso essencial, o manual de anotação, que instancia o modelo de anotação de interesse para a língua em questão e delineia as decisões de anotação que devem ser adotadas. Neste artigo, exploramos questões relacionadas ao desenvolvimento de manuais para a anotação de corpus em português brasileiro segundo o modelo internacional Universal Dependencies, amplamente adotado na área. Partimos da discussão da evolução do PLN e o uso de corpora, passamos pelas questões, recursos e ferramentas fundamentais relacionados à representação sintática, discutimos o modelo Universal Dependencies e apresentamos as principais decisões tomadas na instanciação de suas diretrizes no português brasileiro. Por questões práticas e de didática, dividimos o manual em duas partes: o Manual de Anotação de PoS tags (anotação morfossintática) e o Manual de Anotação Relações de Dependência. Ambos foram resultado do processo relatado neste artigo e estão disponíveis para livre acesso no site do projeto POeTiSA na Web.PP/UFU2022-09-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdftext/xmlhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/6363210.14393/DL52-v16n4a2022-13Domínios de Lingu@gem; Vol. 16 No. 4 (2022): The computational treatment of Brazilian Portuguese; 1608-1643Domínios de Lingu@gem; Vol. 16 Núm. 4 (2022): El tratamiento computacional del portugués brasileño; 1608-1643Domínios de Lingu@gem; v. 16 n. 4 (2022): Tratamento Computacional do Português Brasileiro; 1608-16431980-5799reponame:Domínios de Lingu@geminstname:Universidade Federal de Uberlândia (UFU)instacron:UFUporhttps://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/34631https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/35238Copyright (c) 2022 Magali Duran, Maria das Graças Volpe Nunes, Lucelene Lopes, Thiago Alexandre Salgueiro Pardohttp://creativecommons.org/licenses/by-nc-nd/4.0info:eu-repo/semantics/openAccessDuran, MagaliNunes, Maria das Graças VolpeLopes, LucelenePardo, Thiago Alexandre Salgueiro2022-12-09T18:36:45Zoai:ojs.www.seer.ufu.br:article/63632Revistahttps://seer.ufu.br/index.php/dominiosdelinguagemPUBhttps://seer.ufu.br/index.php/dominiosdelinguagem/oairevistadominios@ileel.ufu.br||1980-57991980-5799opendoar:2022-12-09T18:36:45Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)false
dc.title.none.fl_str_mv An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa
title An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
spellingShingle An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
Duran, Magali
Corpora anotados
Manual de anotação
Universal Dependencies
Árvores de dependência
Português brasileiro
Annotated corpora
Annotation manual
Universal Dependencies
Dependency trees
Brazilian Portuguese
title_short An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
title_full An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
title_fullStr An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
title_full_unstemmed An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
title_sort An annotation manual as a Natural Language Processing resource: the Universal Dependencies model in Portuguese
author Duran, Magali
author_facet Duran, Magali
Nunes, Maria das Graças Volpe
Lopes, Lucelene
Pardo, Thiago Alexandre Salgueiro
author_role author
author2 Nunes, Maria das Graças Volpe
Lopes, Lucelene
Pardo, Thiago Alexandre Salgueiro
author2_role author
author
author
dc.contributor.author.fl_str_mv Duran, Magali
Nunes, Maria das Graças Volpe
Lopes, Lucelene
Pardo, Thiago Alexandre Salgueiro
dc.subject.por.fl_str_mv Corpora anotados
Manual de anotação
Universal Dependencies
Árvores de dependência
Português brasileiro
Annotated corpora
Annotation manual
Universal Dependencies
Dependency trees
Brazilian Portuguese
topic Corpora anotados
Manual de anotação
Universal Dependencies
Árvores de dependência
Português brasileiro
Annotated corpora
Annotation manual
Universal Dependencies
Dependency trees
Brazilian Portuguese
description With the advances of the Natural Language Processing area, corpora are resources that have had a prominent place. More than subsidizing linguistic studies, they constitute the basis for training Machine Learning models and developing cutting-edge computational applications. In particular, there is a great need for annotated corpora, but their production requires another essential resource, the annotation manual, which instantiates the annotation model of interest for the language in question and outlines the annotation decisions that should be adopted. In this paper, we explore issues related to the development of manuals for the annotation of Brazilian Portuguese corpora according to the Universal Dependencies model, widely adopted in the field. We discuss the evolution of NLP and the use of corpora, the fundamental issues, resources and tools related to syntactic representation, the Universal Dependencies model, and the main decisions made in the instantiation of UD guidelines in Brazilian Portuguese. For practical and didactic reasons, we divided the manual into two parts: the PoS Tag Annotation Manual (morphosyntactic annotation) and the Dependency Relations Annotation Manual. Both resulted from the process reported in this paper and are available for free access on the POeTiSA project's Web site.
publishDate 2022
dc.date.none.fl_str_mv 2022-09-12
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632
10.14393/DL52-v16n4a2022-13
url https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632
identifier_str_mv 10.14393/DL52-v16n4a2022-13
dc.language.iso.fl_str_mv por
language por
dc.relation.none.fl_str_mv https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/34631
https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/63632/35238
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
text/xml
dc.publisher.none.fl_str_mv PP/UFU
publisher.none.fl_str_mv PP/UFU
dc.source.none.fl_str_mv Domínios de Lingu@gem; Vol. 16 No. 4 (2022): The computational treatment of Brazilian Portuguese; 1608-1643
Domínios de Lingu@gem; Vol. 16 Núm. 4 (2022): El tratamiento computacional del portugués brasileño; 1608-1643
Domínios de Lingu@gem; v. 16 n. 4 (2022): Tratamento Computacional do Português Brasileiro; 1608-1643
1980-5799
reponame:Domínios de Lingu@gem
instname:Universidade Federal de Uberlândia (UFU)
instacron:UFU
instname_str Universidade Federal de Uberlândia (UFU)
instacron_str UFU
institution UFU
reponame_str Domínios de Lingu@gem
collection Domínios de Lingu@gem
repository.name.fl_str_mv Domínios de Lingu@gem - Universidade Federal de Uberlândia (UFU)
repository.mail.fl_str_mv revistadominios@ileel.ufu.br||
_version_ 1797067717702844416