Tokenization of Portuguese: resolving the hard cases

Detalhes bibliográficos
Autor(a) principal: Branco, António Horta
Data de Publicação: 2003
Outros Autores: Silva, João
Tipo de documento: Relatório
Idioma: por
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10451/14199
Resumo: This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentence
id RCAP_5930bf8b4c3180827a6f2b689c07d237
oai_identifier_str oai:repositorio.ul.pt:10451/14199
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Tokenization of Portuguese: resolving the hard casesTokenizationsentence chunkingtaggingThis research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentenceDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaBranco, António HortaSilva, João2009-02-10T13:11:40Z2003-032003-03-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14199porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:52Zoai:repositorio.ul.pt:10451/14199Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:01.894465Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Tokenization of Portuguese: resolving the hard cases
title Tokenization of Portuguese: resolving the hard cases
spellingShingle Tokenization of Portuguese: resolving the hard cases
Branco, António Horta
Tokenization
sentence chunking
tagging
title_short Tokenization of Portuguese: resolving the hard cases
title_full Tokenization of Portuguese: resolving the hard cases
title_fullStr Tokenization of Portuguese: resolving the hard cases
title_full_unstemmed Tokenization of Portuguese: resolving the hard cases
title_sort Tokenization of Portuguese: resolving the hard cases
author Branco, António Horta
author_facet Branco, António Horta
Silva, João
author_role author
author2 Silva, João
author2_role author
dc.contributor.none.fl_str_mv Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Branco, António Horta
Silva, João
dc.subject.por.fl_str_mv Tokenization
sentence chunking
tagging
topic Tokenization
sentence chunking
tagging
description This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentence
publishDate 2003
dc.date.none.fl_str_mv 2003-03
2003-03-01T00:00:00Z
2009-02-10T13:11:40Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/report
format report
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/14199
url http://hdl.handle.net/10451/14199
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Department of Informatics, University of Lisbon
publisher.none.fl_str_mv Department of Informatics, University of Lisbon
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134258627870720