AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5

Detalhes bibliográficos
Autor(a) principal: Katiuscia de Moraes Andrade
Data de Publicação: 2013
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Biblioteca Digital de Teses e Dissertações da UFC
Texto Completo: http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=11588
Resumo: AstrolÃbio is a compiled corpus, with multidimensional annotation, and shared under Creative Commons Attribution-NonCommercial 3.0 Unported licence. It is a corpus, in Brazilian Portuguese, that uses advanced technologies to text processing and corpora annotation. AstrolÃbio has multidimensional annotation based on TEI P5 guidelines, that prescribes XML metalanguage. Through these guidelines, essential structures from the annotated documents were preserved, keeping the transcription as reliable as possible to the original. By using tag <choice>, it enabled keep, in the same archive, linguistic variation phenomena, orthographic and punctuation errors, as the respectives corrected and normalized forms, and also makes possible the visualization of added and deleted terms. To automatize the integration of many levels of annotation, Astro was used, it is a software that works with several Python modules to Natural Language Processing (NLP), including Aelius and Enchant. To POS tagging, Aelius, a package that uses Natural Language Toolkit (NLTK) libraries, was utilized. From Aelius, AeliusHunPosMacMorpho was chosen, it is a tagger based on HunPos and trained by MAC-Morpho, a corpus composed of journalistic texts. The 9spell checking was made by Enchant, a large library with API (Application Programming Interface) in C and C++ languages. The tagger chosen from inside training corpus MacMorpho,. AstrolÃbio's texts were produced during text production workshops from the second edition of Rota das Especiarias project, realized on first semester of 2012, with public school students from Camocim, Barroquinha e Jijoca de Jericoacoara, cities located in CearÃ. Until this moment of AstrolÃbio's creation, concluded stages are texts selection, compilation and the first step of automatic annotation by Astro. AstrolÃbio corpus is already partially avaiable at Rota das Especiarias' website (www.rotadasespeciarias.art.br). Soon, the corpus will be submitted to University of Oxford Text Archive (OTA). As we observed from corpora scene of Portuguese, there's no corpus, in Brazilian Portuguese, with this level of annotation.
id UFC_ad2e1554fc276e4e9a9c6224a533f9a2
oai_identifier_str oai:www.teses.ufc.br:7934
network_acronym_str UFC
network_name_str Biblioteca Digital de Teses e Dissertações da UFC
spelling info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisAstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5AstrolÃbio: a corpus of school writings of Cearà multi-dimensionally annotated according to TEI P52013-02-18Leonel Figueiredo de Alencar Araripe30808545353Rosemeire Selma Monteiro93020473934VlÃdia CÃlia Monteiro Pinheiro35849487387http://lattes.cnpq.br/299128156551893486891170320http://lattes.cnpq.br/3356857811523580Katiuscia de Moraes AndradeUniversidade Federal do CearÃPrograma de PÃs-GraduaÃÃo em LingÃÃsticaUFCBRComputacional linguistics Corpus linguistics TEI P5 NLTK Spell checking POS taggingLinguÃstica Computacional LinguÃstica de Corpus TEI P5 NLTK CorreÃÃo automÃtica de textos Etiquetagem morfossintÃtica LINGUISTICAAstrolÃbio is a compiled corpus, with multidimensional annotation, and shared under Creative Commons Attribution-NonCommercial 3.0 Unported licence. It is a corpus, in Brazilian Portuguese, that uses advanced technologies to text processing and corpora annotation. AstrolÃbio has multidimensional annotation based on TEI P5 guidelines, that prescribes XML metalanguage. Through these guidelines, essential structures from the annotated documents were preserved, keeping the transcription as reliable as possible to the original. By using tag <choice>, it enabled keep, in the same archive, linguistic variation phenomena, orthographic and punctuation errors, as the respectives corrected and normalized forms, and also makes possible the visualization of added and deleted terms. To automatize the integration of many levels of annotation, Astro was used, it is a software that works with several Python modules to Natural Language Processing (NLP), including Aelius and Enchant. To POS tagging, Aelius, a package that uses Natural Language Toolkit (NLTK) libraries, was utilized. From Aelius, AeliusHunPosMacMorpho was chosen, it is a tagger based on HunPos and trained by MAC-Morpho, a corpus composed of journalistic texts. The 9spell checking was made by Enchant, a large library with API (Application Programming Interface) in C and C++ languages. The tagger chosen from inside training corpus MacMorpho,. AstrolÃbio's texts were produced during text production workshops from the second edition of Rota das Especiarias project, realized on first semester of 2012, with public school students from Camocim, Barroquinha e Jijoca de Jericoacoara, cities located in CearÃ. Until this moment of AstrolÃbio's creation, concluded stages are texts selection, compilation and the first step of automatic annotation by Astro. AstrolÃbio corpus is already partially avaiable at Rota das Especiarias' website (www.rotadasespeciarias.art.br). Soon, the corpus will be submitted to University of Oxford Text Archive (OTA). As we observed from corpora scene of Portuguese, there's no corpus, in Brazilian Portuguese, with this level of annotation. AstrolÃbio à um corpus compilado, anotado multidimensionalmente e disponibilizado eletronicamente sob a licenÃa Creative Commons Attribution-NonCommercial 3.0 Unported. Trata-se de um corpus, em PortuguÃs brasileiro, que emprega avanÃadas tecnologias para o processamento de texto e anotaÃÃo de corpora. AstrolÃbio possui anotaÃÃo multidimensional baseada na codificaÃÃo TEI P5, que prescreve o uso metalinguagem XML. Com o uso dessa codificaÃÃo, preservaram-se caracterÃsticas essenciais da estrutura e do conteÃdo dos documentos anotados, tornando a transcriÃÃo o mais fiel possÃvel ao original. Por meio do emprego da tag <choice>, foi possÃvel reunir, em um mesmo arquivo, fenÃmenos de variaÃÃo linguÃstica, erros ortogrÃficos e de pontuaÃÃo, bem como as respectivas formas corrigidas e normalizadas, alÃm de possibilitar a visualizaÃÃo de termos que foram acrescidos ou suprimidos. Para a integraÃÃo automÃtica dos vÃrios nÃveis de anotaÃÃo, utilizou-se o Astro, um software que utiliza diversos mÃdulos em Python para o Processamento da Linguagem Natural (PLN), como o Aelius e o Enchant. Na etiquetagem morfossintÃtica, utilizou-se o pacote Aelius, que, por sua vez, recorre à biblioteca Natural Language Toolkit (NLTK). O etiquetador escolhido, dentro do Aelius, foi o AeliusHunposMacMorpho, criado a partir do etiquetador Hunpos, treinado no corpus de textos jornalÃsticos MAC-Morpho. Efetivou-se a correÃÃo ortogrÃfica com o Enchant, uma vasta biblioteca com API (Application Programming Interface) em linguagem C e C++. Os textos que compÃem esse corpus foram produzidos durante as oficinas de produÃÃo textual da segunda ediÃÃo do projeto Rota das Especiarias, realizadas no primeiro semestre de 2012, com alunos de escolas pÃblicas das cidades cearenses de Camocim, Barroquinha e Jijoca de Jericoacoara. Atà o presente momento da construÃÃo do AstrolÃbio, encontram-se concluÃdas as etapas de seleÃÃo, escanerizaÃÃo, compilaÃÃo e a primeira fase de anotaÃÃo automÃtica dos textos por meio do Astro. O corpus AstrolÃbio jà se encontra parcialmente disponÃvel no sÃtio eletrÃnico Rota das Especiarias (www.rotadasespeciarias.art.br). Em breve, serà submetido ao repositÃrio eletrÃnico University of Oxford Text Archive (OTA). Pelo que se observou do panorama de corpora do PortuguÃs, inexiste um corpus, em PortuguÃs Brasileiro, com esse nÃvel de anotaÃÃo. CoordenaÃÃo de AperfeiÃoamento de Pessoal de NÃvel Superiorhttp://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=11588application/pdfinfo:eu-repo/semantics/openAccessporreponame:Biblioteca Digital de Teses e Dissertações da UFCinstname:Universidade Federal do Cearáinstacron:UFC2019-01-21T11:24:45Zmail@mail.com -
dc.title.pt.fl_str_mv AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
dc.title.alternative..fl_str_mv AstrolÃbio: a corpus of school writings of Cearà multi-dimensionally annotated according to TEI P5
title AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
spellingShingle AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
Katiuscia de Moraes Andrade
LinguÃstica Computacional
LinguÃstica de Corpus
TEI P5
NLTK
CorreÃÃo automÃtica de textos
Etiquetagem morfossintÃtica
LINGUISTICA
title_short AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
title_full AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
title_fullStr AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
title_full_unstemmed AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
title_sort AstrolÃbio: um corpus de redaÃÃes escolares do Cearà anotado multidimensionalmente conforme a TEI P5
author Katiuscia de Moraes Andrade
author_facet Katiuscia de Moraes Andrade
author_role author
dc.contributor.advisor1.fl_str_mv Leonel Figueiredo de Alencar Araripe
dc.contributor.advisor1ID.fl_str_mv 30808545353
dc.contributor.referee1.fl_str_mv Rosemeire Selma Monteiro
dc.contributor.referee1ID.fl_str_mv 93020473934
dc.contributor.referee2.fl_str_mv VlÃdia CÃlia Monteiro Pinheiro
dc.contributor.referee2ID.fl_str_mv 35849487387
dc.contributor.referee2Lattes.fl_str_mv http://lattes.cnpq.br/2991281565518934
dc.contributor.authorID.fl_str_mv 86891170320
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/3356857811523580
dc.contributor.author.fl_str_mv Katiuscia de Moraes Andrade
contributor_str_mv Leonel Figueiredo de Alencar Araripe
Rosemeire Selma Monteiro
VlÃdia CÃlia Monteiro Pinheiro
dc.subject.por.fl_str_mv LinguÃstica Computacional
LinguÃstica de Corpus
TEI P5
NLTK
CorreÃÃo automÃtica de textos
Etiquetagem morfossintÃtica
topic LinguÃstica Computacional
LinguÃstica de Corpus
TEI P5
NLTK
CorreÃÃo automÃtica de textos
Etiquetagem morfossintÃtica
LINGUISTICA
dc.subject.cnpq.fl_str_mv LINGUISTICA
dc.description.sponsorship.fl_txt_mv CoordenaÃÃo de AperfeiÃoamento de Pessoal de NÃvel Superior
dc.description.abstract..fl_txt_mv AstrolÃbio is a compiled corpus, with multidimensional annotation, and shared under Creative Commons Attribution-NonCommercial 3.0 Unported licence. It is a corpus, in Brazilian Portuguese, that uses advanced technologies to text processing and corpora annotation. AstrolÃbio has multidimensional annotation based on TEI P5 guidelines, that prescribes XML metalanguage. Through these guidelines, essential structures from the annotated documents were preserved, keeping the transcription as reliable as possible to the original. By using tag <choice>, it enabled keep, in the same archive, linguistic variation phenomena, orthographic and punctuation errors, as the respectives corrected and normalized forms, and also makes possible the visualization of added and deleted terms. To automatize the integration of many levels of annotation, Astro was used, it is a software that works with several Python modules to Natural Language Processing (NLP), including Aelius and Enchant. To POS tagging, Aelius, a package that uses Natural Language Toolkit (NLTK) libraries, was utilized. From Aelius, AeliusHunPosMacMorpho was chosen, it is a tagger based on HunPos and trained by MAC-Morpho, a corpus composed of journalistic texts. The 9spell checking was made by Enchant, a large library with API (Application Programming Interface) in C and C++ languages. The tagger chosen from inside training corpus MacMorpho,. AstrolÃbio's texts were produced during text production workshops from the second edition of Rota das Especiarias project, realized on first semester of 2012, with public school students from Camocim, Barroquinha e Jijoca de Jericoacoara, cities located in CearÃ. Until this moment of AstrolÃbio's creation, concluded stages are texts selection, compilation and the first step of automatic annotation by Astro. AstrolÃbio corpus is already partially avaiable at Rota das Especiarias' website (www.rotadasespeciarias.art.br). Soon, the corpus will be submitted to University of Oxford Text Archive (OTA). As we observed from corpora scene of Portuguese, there's no corpus, in Brazilian Portuguese, with this level of annotation.
dc.description.abstract.por.fl_txt_mv AstrolÃbio à um corpus compilado, anotado multidimensionalmente e disponibilizado eletronicamente sob a licenÃa Creative Commons Attribution-NonCommercial 3.0 Unported. Trata-se de um corpus, em PortuguÃs brasileiro, que emprega avanÃadas tecnologias para o processamento de texto e anotaÃÃo de corpora. AstrolÃbio possui anotaÃÃo multidimensional baseada na codificaÃÃo TEI P5, que prescreve o uso metalinguagem XML. Com o uso dessa codificaÃÃo, preservaram-se caracterÃsticas essenciais da estrutura e do conteÃdo dos documentos anotados, tornando a transcriÃÃo o mais fiel possÃvel ao original. Por meio do emprego da tag <choice>, foi possÃvel reunir, em um mesmo arquivo, fenÃmenos de variaÃÃo linguÃstica, erros ortogrÃficos e de pontuaÃÃo, bem como as respectivas formas corrigidas e normalizadas, alÃm de possibilitar a visualizaÃÃo de termos que foram acrescidos ou suprimidos. Para a integraÃÃo automÃtica dos vÃrios nÃveis de anotaÃÃo, utilizou-se o Astro, um software que utiliza diversos mÃdulos em Python para o Processamento da Linguagem Natural (PLN), como o Aelius e o Enchant. Na etiquetagem morfossintÃtica, utilizou-se o pacote Aelius, que, por sua vez, recorre à biblioteca Natural Language Toolkit (NLTK). O etiquetador escolhido, dentro do Aelius, foi o AeliusHunposMacMorpho, criado a partir do etiquetador Hunpos, treinado no corpus de textos jornalÃsticos MAC-Morpho. Efetivou-se a correÃÃo ortogrÃfica com o Enchant, uma vasta biblioteca com API (Application Programming Interface) em linguagem C e C++. Os textos que compÃem esse corpus foram produzidos durante as oficinas de produÃÃo textual da segunda ediÃÃo do projeto Rota das Especiarias, realizadas no primeiro semestre de 2012, com alunos de escolas pÃblicas das cidades cearenses de Camocim, Barroquinha e Jijoca de Jericoacoara. Atà o presente momento da construÃÃo do AstrolÃbio, encontram-se concluÃdas as etapas de seleÃÃo, escanerizaÃÃo, compilaÃÃo e a primeira fase de anotaÃÃo automÃtica dos textos por meio do Astro. O corpus AstrolÃbio jà se encontra parcialmente disponÃvel no sÃtio eletrÃnico Rota das Especiarias (www.rotadasespeciarias.art.br). Em breve, serà submetido ao repositÃrio eletrÃnico University of Oxford Text Archive (OTA). Pelo que se observou do panorama de corpora do PortuguÃs, inexiste um corpus, em PortuguÃs Brasileiro, com esse nÃvel de anotaÃÃo.
description AstrolÃbio is a compiled corpus, with multidimensional annotation, and shared under Creative Commons Attribution-NonCommercial 3.0 Unported licence. It is a corpus, in Brazilian Portuguese, that uses advanced technologies to text processing and corpora annotation. AstrolÃbio has multidimensional annotation based on TEI P5 guidelines, that prescribes XML metalanguage. Through these guidelines, essential structures from the annotated documents were preserved, keeping the transcription as reliable as possible to the original. By using tag <choice>, it enabled keep, in the same archive, linguistic variation phenomena, orthographic and punctuation errors, as the respectives corrected and normalized forms, and also makes possible the visualization of added and deleted terms. To automatize the integration of many levels of annotation, Astro was used, it is a software that works with several Python modules to Natural Language Processing (NLP), including Aelius and Enchant. To POS tagging, Aelius, a package that uses Natural Language Toolkit (NLTK) libraries, was utilized. From Aelius, AeliusHunPosMacMorpho was chosen, it is a tagger based on HunPos and trained by MAC-Morpho, a corpus composed of journalistic texts. The 9spell checking was made by Enchant, a large library with API (Application Programming Interface) in C and C++ languages. The tagger chosen from inside training corpus MacMorpho,. AstrolÃbio's texts were produced during text production workshops from the second edition of Rota das Especiarias project, realized on first semester of 2012, with public school students from Camocim, Barroquinha e Jijoca de Jericoacoara, cities located in CearÃ. Until this moment of AstrolÃbio's creation, concluded stages are texts selection, compilation and the first step of automatic annotation by Astro. AstrolÃbio corpus is already partially avaiable at Rota das Especiarias' website (www.rotadasespeciarias.art.br). Soon, the corpus will be submitted to University of Oxford Text Archive (OTA). As we observed from corpora scene of Portuguese, there's no corpus, in Brazilian Portuguese, with this level of annotation.
publishDate 2013
dc.date.issued.fl_str_mv 2013-02-18
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
status_str publishedVersion
format masterThesis
dc.identifier.uri.fl_str_mv http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=11588
url http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=11588
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal do CearÃ
dc.publisher.program.fl_str_mv Programa de PÃs-GraduaÃÃo em LingÃÃstica
dc.publisher.initials.fl_str_mv UFC
dc.publisher.country.fl_str_mv BR
publisher.none.fl_str_mv Universidade Federal do CearÃ
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da UFC
instname:Universidade Federal do Ceará
instacron:UFC
reponame_str Biblioteca Digital de Teses e Dissertações da UFC
collection Biblioteca Digital de Teses e Dissertações da UFC
instname_str Universidade Federal do Ceará
instacron_str UFC
institution UFC
repository.name.fl_str_mv -
repository.mail.fl_str_mv mail@mail.com
_version_ 1643295186169626624