A multilingual and multidomain study on dialog act recognition using character-level tokenization

Ribeiro, E.; Ribeiro, R.; de Matos, D. M.

A multilingual and multidomain study on dialog act recognition using character-level tokenization

Detalhes bibliográficos
Autor(a) principal:	Ribeiro, E.
Data de Publicação:	2019
Outros Autores:	Ribeiro, R., de Matos, D. M.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10071/17539
Resumo:	Automatic dialog act recognition is an important step for dialog systems since it reveals the intention behind the words uttered by its conversational partners. Although most approaches on the task use word-level tokenization, there is information at the sub-word level that is related to the function of the words and, consequently, their intention. Thus, in this study, we explored the use of character-level tokenization to capture that information. We explored the use of multiple character windows of different sizes to capture morphological aspects, such as affixes and lemmas, as well as inter-word information. Furthermore, we assessed the importance of punctuation and capitalization for the task. To broaden the conclusions of our study, we performed experiments on dialogs in three languages—English, Spanish, and German—which have different morphological characteristics. Furthermore, the dialogs cover multiple domains and are annotated with both domain-dependent and domain-independent dialog act labels. The achieved results not only show that the character-level approach leads to similar or better performance than the state-of-the-art word-level approaches on the task, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.

Metadados do item

id	RCAP_6a6e418fd6eec78a48e66a7fa8c1e18f
oai_identifier_str	oai:repositorio.iscte-iul.pt:10071/17539
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	A multilingual and multidomain study on dialog act recognition using character-level tokenizationDialog act recognitionCharacter-levelMultilingualityMultidomainAutomatic dialog act recognition is an important step for dialog systems since it reveals the intention behind the words uttered by its conversational partners. Although most approaches on the task use word-level tokenization, there is information at the sub-word level that is related to the function of the words and, consequently, their intention. Thus, in this study, we explored the use of character-level tokenization to capture that information. We explored the use of multiple character windows of different sizes to capture morphological aspects, such as affixes and lemmas, as well as inter-word information. Furthermore, we assessed the importance of punctuation and capitalization for the task. To broaden the conclusions of our study, we performed experiments on dialogs in three languages—English, Spanish, and German—which have different morphological characteristics. Furthermore, the dialogs cover multiple domains and are annotated with both domain-dependent and domain-independent dialog act labels. The achieved results not only show that the character-level approach leads to similar or better performance than the state-of-the-art word-level approaches on the task, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.MDPI AG2019-03-08T15:20:26Z2019-01-01T00:00:00Z20192019-03-08T15:19:55Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/17539eng2078-248910.3390/info10030094Ribeiro, E.Ribeiro, R.de Matos, D. M.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:49:25Zoai:repositorio.iscte-iul.pt:10071/17539Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:24:16.365941Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	A multilingual and multidomain study on dialog act recognition using character-level tokenization
title	A multilingual and multidomain study on dialog act recognition using character-level tokenization
spellingShingle	A multilingual and multidomain study on dialog act recognition using character-level tokenization Ribeiro, E. Dialog act recognition Character-level Multilinguality Multidomain
title_short	A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_full	A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_fullStr	A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_full_unstemmed	A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_sort	A multilingual and multidomain study on dialog act recognition using character-level tokenization
author	Ribeiro, E.
author_facet	Ribeiro, E. Ribeiro, R. de Matos, D. M.
author_role	author
author2	Ribeiro, R. de Matos, D. M.
author2_role	author author
dc.contributor.author.fl_str_mv	Ribeiro, E. Ribeiro, R. de Matos, D. M.
dc.subject.por.fl_str_mv	Dialog act recognition Character-level Multilinguality Multidomain
topic	Dialog act recognition Character-level Multilinguality Multidomain
description	Automatic dialog act recognition is an important step for dialog systems since it reveals the intention behind the words uttered by its conversational partners. Although most approaches on the task use word-level tokenization, there is information at the sub-word level that is related to the function of the words and, consequently, their intention. Thus, in this study, we explored the use of character-level tokenization to capture that information. We explored the use of multiple character windows of different sizes to capture morphological aspects, such as affixes and lemmas, as well as inter-word information. Furthermore, we assessed the importance of punctuation and capitalization for the task. To broaden the conclusions of our study, we performed experiments on dialogs in three languages—English, Spanish, and German—which have different morphological characteristics. Furthermore, the dialogs cover multiple domains and are annotated with both domain-dependent and domain-independent dialog act labels. The achieved results not only show that the character-level approach leads to similar or better performance than the state-of-the-art word-level approaches on the task, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.
publishDate	2019
dc.date.none.fl_str_mv	2019-03-08T15:20:26Z 2019-01-01T00:00:00Z 2019 2019-03-08T15:19:55Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10071/17539
url	http://hdl.handle.net/10071/17539
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	2078-2489 10.3390/info10030094
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	MDPI AG
publisher.none.fl_str_mv	MDPI AG
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134805154070528

A multilingual and multidomain study on dialog act recognition using character-level tokenization

Registros relacionados