Tokenization of Portuguese: resolving the hard cases

Branco, António Horta; Silva, João

Tokenization of Portuguese: resolving the hard cases

Detalhes bibliográficos
Autor(a) principal:	Branco, António Horta
Data de Publicação:	2003
Outros Autores:	Silva, João
Tipo de documento:	Relatório
Idioma:	por
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10451/14199
Resumo:	This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentence

Metadados do item

id	RCAP_5930bf8b4c3180827a6f2b689c07d237
oai_identifier_str	oai:repositorio.ul.pt:10451/14199
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Tokenization of Portuguese: resolving the hard casesTokenizationsentence chunkingtaggingThis research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentenceDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaBranco, António HortaSilva, João2009-02-10T13:11:40Z2003-032003-03-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14199porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:52Zoai:repositorio.ul.pt:10451/14199Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:01.894465Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Tokenization of Portuguese: resolving the hard cases
title	Tokenization of Portuguese: resolving the hard cases
spellingShingle	Tokenization of Portuguese: resolving the hard cases Branco, António Horta Tokenization sentence chunking tagging
title_short	Tokenization of Portuguese: resolving the hard cases
title_full	Tokenization of Portuguese: resolving the hard cases
title_fullStr	Tokenization of Portuguese: resolving the hard cases
title_full_unstemmed	Tokenization of Portuguese: resolving the hard cases
title_sort	Tokenization of Portuguese: resolving the hard cases
author	Branco, António Horta
author_facet	Branco, António Horta Silva, João
author_role	author
author2	Silva, João
author2_role	author
dc.contributor.none.fl_str_mv	Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv	Branco, António Horta Silva, João
dc.subject.por.fl_str_mv	Tokenization sentence chunking tagging
topic	Tokenization sentence chunking tagging
description	This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that tokenization takes place before tagging. The critical importance of this apparently minor, low-level issue results from the fact that these strings correspond mostly to functional words, that they are quite frequent, covering over 2% of a corpus, and that their careless treatment would introduce unrecoverable degradation of performance at a very early stage of language processing and that this degradation would trigger further and wider loss of accuracy in all subsequent processing stages. We argue for a resolution of this circularity on the basis of a new, two-level approach to tokenization. This approach is shown to be used also to improve the problem of sentence chunking at periods that are ambivalent between marking the end of an abbreviation and the end of a sentence
publishDate	2003
dc.date.none.fl_str_mv	2003-03 2003-03-01T00:00:00Z 2009-02-10T13:11:40Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/report
format	report
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10451/14199
url	http://hdl.handle.net/10451/14199
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Department of Informatics, University of Lisbon
publisher.none.fl_str_mv	Department of Informatics, University of Lisbon
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134258627870720

Tokenization of Portuguese: resolving the hard cases

Registros relacionados