Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary

Detalhes bibliográficos
Autor(a) principal: Finatto, Maria José Bocorny
Data de Publicação: 2019
Outros Autores: Vale, Oto Araújo, Laporte, Éric
Tipo de documento: Artigo
Idioma: por
eng
Título da fonte: Alfa (São José do Rio Preto. Online)
Texto Completo: https://periodicos.fclar.unesp.br/alfa/article/view/11234
Resumo: We report an experiment of checking the identification of a set of words in popular Portuguese written text with two versions of a computational dictionary of Brazilian Portuguese, DELAF PB 2004 and DELAF PB 2015. This computational dictionary is freely available for use in linguistic analyses of Brazilian Portuguese and other research, which gives reasons for undertaking a critical study. The set of words comes from the PorPopular corpus, composed of popular newspapers, the Diário Gaúcho (DG) and the Bahian newspaper Massa! (MA). From DG, we studied a set of texts with 984,465 words (tokens), published in 2008, in the spelling used before the Orthographic Agreement of the Portuguese Language adopted in 2009. From MA, we examined a vocabulary of 215,776 words (tokens), from papers published in 2012, 2014 and 2015 in the new spelling. The verification involved: a) generating lists of unique words used in DG and MA; b) comparing these lists with the entry lists of the two versions of DELAF PB; c) assessing the coverage of this vocabulary; d) proposing ways of including the items not covered. The results showed that an average of 19% of the types in the DG corpus were unknown by the DELAF PB 2004 and 2015. In the MA sample, this average was 13%. The version of the dictionary impacted slightly on item recognition performance.
id UNESP-4_21e284d52de5588c979815808d41d955
oai_identifier_str oai:ojs.pkp.sfu.ca:article/11234
network_acronym_str UNESP-4
network_name_str Alfa (São José do Rio Preto. Online)
repository_id_str
spelling Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionaryReconhecimento do vocabulário de jornais populares brasileiros por um dicionário computacional de acesso livrePopular newspapersLexicVocabularyComputational dictionaryLexical coverageRecognition of wordsBrazilian PortugueseJornais popularesLéxicoVocabulárioDicionário computacionalCobertura lexicalReconhecimento de palavrasPortuguês brasileiroWe report an experiment of checking the identification of a set of words in popular Portuguese written text with two versions of a computational dictionary of Brazilian Portuguese, DELAF PB 2004 and DELAF PB 2015. This computational dictionary is freely available for use in linguistic analyses of Brazilian Portuguese and other research, which gives reasons for undertaking a critical study. The set of words comes from the PorPopular corpus, composed of popular newspapers, the Diário Gaúcho (DG) and the Bahian newspaper Massa! (MA). From DG, we studied a set of texts with 984,465 words (tokens), published in 2008, in the spelling used before the Orthographic Agreement of the Portuguese Language adopted in 2009. From MA, we examined a vocabulary of 215,776 words (tokens), from papers published in 2012, 2014 and 2015 in the new spelling. The verification involved: a) generating lists of unique words used in DG and MA; b) comparing these lists with the entry lists of the two versions of DELAF PB; c) assessing the coverage of this vocabulary; d) proposing ways of including the items not covered. The results showed that an average of 19% of the types in the DG corpus were unknown by the DELAF PB 2004 and 2015. In the MA sample, this average was 13%. The version of the dictionary impacted slightly on item recognition performance.Relata-se um experimento de verificação da identificação de um universo de palavras do português popular escrito por duas versões de um dicionário computacional do português brasileiro (PB), DELAF PB 2004 e DELAF PB 2015. Esse dicionário computacional é gratuitamente acessível para ser utilizado em análises linguísticas do Português do Brasil e em outras pesquisas, o que justifica um estudo crítico. O universo vocabular provém do corpus PorPopular, composto por jornais populares, o Diário Gaúcho (DG) e o jornal baiano Massa! (MA). Do DG, partiu-se de um conjunto de textos com 984.465 palavras (tokens), publicados em 2008, com ortografia desatualizada frente ao Acordo Ortográfico da Língua Portuguesa adotado em 2009. Do MA, examinou-se um universo com 215.776 palavras (tokens), em publicações de 2012, 2014 e 2015, com todo o material na nova ortografia. A verificação envolveu: a) gerar listas de palavras diferentes empregadas em DG e MA; b) comparar essas listas com as listas de entradas das duas versões do DELAF PB; c) avaliar a cobertura desse vocabulário; d) propor modos de inclusão de itens não cobertos. Os resultados do trabalho mostraram, no DG, uma média de 19% de palavras diferentes (types) desconhecidas pelos DELAF PB 2004 e 2015. No MA, essa média ficou em 13%. A versão do dicionário repercutiu ligeiramente sobre o desempenho do reconhecimento de itens.UNESP2019-04-15info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfapplication/pdfhttps://periodicos.fclar.unesp.br/alfa/article/view/1123410.1590/1981-5794-1904-3ALFA: Revista de Linguística; v. 63 n. 1 (2019)1981-5794reponame:Alfa (São José do Rio Preto. Online)instname:Universidade Estadual Paulista (UNESP)instacron:UNESPporenghttps://periodicos.fclar.unesp.br/alfa/article/view/11234/8182https://periodicos.fclar.unesp.br/alfa/article/view/11234/8178Copyright (c) 2019 ALFA: Revista de Linguísticainfo:eu-repo/semantics/openAccessFinatto, Maria José BocornyVale, Oto AraújoLaporte, Éric2019-04-15T19:45:02Zoai:ojs.pkp.sfu.ca:article/11234Revistahttp://www.scielo.br/scielo.php?script=sci_serial&pid=1981-5794&lng=pt&nrm=isoPUBhttps://old.scielo.br/oai/scielo-oai.phpalfa@unesp.br1981-57940002-5216opendoar:2019-04-15T19:45:02Alfa (São José do Rio Preto. Online) - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
Reconhecimento do vocabulário de jornais populares brasileiros por um dicionário computacional de acesso livre
title Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
spellingShingle Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
Finatto, Maria José Bocorny
Popular newspapers
Lexic
Vocabulary
Computational dictionary
Lexical coverage
Recognition of words
Brazilian Portuguese
Jornais populares
Léxico
Vocabulário
Dicionário computacional
Cobertura lexical
Reconhecimento de palavras
Português brasileiro
title_short Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
title_full Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
title_fullStr Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
title_full_unstemmed Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
title_sort Recognition of the vocabulary of popular Brazilian newspapers with a freely available computational dictionary
author Finatto, Maria José Bocorny
author_facet Finatto, Maria José Bocorny
Vale, Oto Araújo
Laporte, Éric
author_role author
author2 Vale, Oto Araújo
Laporte, Éric
author2_role author
author
dc.contributor.author.fl_str_mv Finatto, Maria José Bocorny
Vale, Oto Araújo
Laporte, Éric
dc.subject.por.fl_str_mv Popular newspapers
Lexic
Vocabulary
Computational dictionary
Lexical coverage
Recognition of words
Brazilian Portuguese
Jornais populares
Léxico
Vocabulário
Dicionário computacional
Cobertura lexical
Reconhecimento de palavras
Português brasileiro
topic Popular newspapers
Lexic
Vocabulary
Computational dictionary
Lexical coverage
Recognition of words
Brazilian Portuguese
Jornais populares
Léxico
Vocabulário
Dicionário computacional
Cobertura lexical
Reconhecimento de palavras
Português brasileiro
description We report an experiment of checking the identification of a set of words in popular Portuguese written text with two versions of a computational dictionary of Brazilian Portuguese, DELAF PB 2004 and DELAF PB 2015. This computational dictionary is freely available for use in linguistic analyses of Brazilian Portuguese and other research, which gives reasons for undertaking a critical study. The set of words comes from the PorPopular corpus, composed of popular newspapers, the Diário Gaúcho (DG) and the Bahian newspaper Massa! (MA). From DG, we studied a set of texts with 984,465 words (tokens), published in 2008, in the spelling used before the Orthographic Agreement of the Portuguese Language adopted in 2009. From MA, we examined a vocabulary of 215,776 words (tokens), from papers published in 2012, 2014 and 2015 in the new spelling. The verification involved: a) generating lists of unique words used in DG and MA; b) comparing these lists with the entry lists of the two versions of DELAF PB; c) assessing the coverage of this vocabulary; d) proposing ways of including the items not covered. The results showed that an average of 19% of the types in the DG corpus were unknown by the DELAF PB 2004 and 2015. In the MA sample, this average was 13%. The version of the dictionary impacted slightly on item recognition performance.
publishDate 2019
dc.date.none.fl_str_mv 2019-04-15
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://periodicos.fclar.unesp.br/alfa/article/view/11234
10.1590/1981-5794-1904-3
url https://periodicos.fclar.unesp.br/alfa/article/view/11234
identifier_str_mv 10.1590/1981-5794-1904-3
dc.language.iso.fl_str_mv por
eng
language por
eng
dc.relation.none.fl_str_mv https://periodicos.fclar.unesp.br/alfa/article/view/11234/8182
https://periodicos.fclar.unesp.br/alfa/article/view/11234/8178
dc.rights.driver.fl_str_mv Copyright (c) 2019 ALFA: Revista de Linguística
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2019 ALFA: Revista de Linguística
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv UNESP
publisher.none.fl_str_mv UNESP
dc.source.none.fl_str_mv ALFA: Revista de Linguística; v. 63 n. 1 (2019)
1981-5794
reponame:Alfa (São José do Rio Preto. Online)
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Alfa (São José do Rio Preto. Online)
collection Alfa (São José do Rio Preto. Online)
repository.name.fl_str_mv Alfa (São José do Rio Preto. Online) - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv alfa@unesp.br
_version_ 1800214377483206656