Tokenization of Portuguese: resolving the hard cases
Autor(a) principal: | |
---|---|
Data de Publicação: | 2003 |
Outros Autores: | |
Tipo de documento: | Relatório |
Idioma: | por |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10451/14199 |
Resumo: | This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentence |
id |
RCAP_5930bf8b4c3180827a6f2b689c07d237 |
---|---|
oai_identifier_str |
oai:repositorio.ul.pt:10451/14199 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Tokenization of Portuguese: resolving the hard casesTokenizationsentence chunkingtaggingThis research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentenceDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaBranco, António HortaSilva, João2009-02-10T13:11:40Z2003-032003-03-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14199porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:52Zoai:repositorio.ul.pt:10451/14199Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:01.894465Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Tokenization of Portuguese: resolving the hard cases |
title |
Tokenization of Portuguese: resolving the hard cases |
spellingShingle |
Tokenization of Portuguese: resolving the hard cases Branco, António Horta Tokenization sentence chunking tagging |
title_short |
Tokenization of Portuguese: resolving the hard cases |
title_full |
Tokenization of Portuguese: resolving the hard cases |
title_fullStr |
Tokenization of Portuguese: resolving the hard cases |
title_full_unstemmed |
Tokenization of Portuguese: resolving the hard cases |
title_sort |
Tokenization of Portuguese: resolving the hard cases |
author |
Branco, António Horta |
author_facet |
Branco, António Horta Silva, João |
author_role |
author |
author2 |
Silva, João |
author2_role |
author |
dc.contributor.none.fl_str_mv |
Repositório da Universidade de Lisboa |
dc.contributor.author.fl_str_mv |
Branco, António Horta Silva, João |
dc.subject.por.fl_str_mv |
Tokenization sentence chunking tagging |
topic |
Tokenization sentence chunking tagging |
description |
This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentence |
publishDate |
2003 |
dc.date.none.fl_str_mv |
2003-03 2003-03-01T00:00:00Z 2009-02-10T13:11:40Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/report |
format |
report |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10451/14199 |
url |
http://hdl.handle.net/10451/14199 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Department of Informatics, University of Lisbon |
publisher.none.fl_str_mv |
Department of Informatics, University of Lisbon |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134258627870720 |