A multilingual and multidomain study on dialog act recognition using character-level tokenization

Detalhes bibliográficos
Autor(a) principal: Ribeiro, E.
Data de Publicação: 2019
Outros Autores: Ribeiro, R., de Matos, D. M.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/17539
Resumo: Automatic dialog act recognition is an important step for dialog systems since it reveals the intention behind the words uttered by its conversational partners. Although most approaches on the task use word-level tokenization, there is information at the sub-word level that is related to the function of the words and, consequently, their intention. Thus, in this study, we explored the use of character-level tokenization to capture that information. We explored the use of multiple character windows of different sizes to capture morphological aspects, such as affixes and lemmas, as well as inter-word information. Furthermore, we assessed the importance of punctuation and capitalization for the task. To broaden the conclusions of our study, we performed experiments on dialogs in three languages—English, Spanish, and German—which have different morphological characteristics. Furthermore, the dialogs cover multiple domains and are annotated with both domain-dependent and domain-independent dialog act labels. The achieved results not only show that the character-level approach leads to similar or better performance than the state-of-the-art word-level approaches on the task, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.
id RCAP_6a6e418fd6eec78a48e66a7fa8c1e18f
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/17539
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling A multilingual and multidomain study on dialog act recognition using character-level tokenizationDialog act recognitionCharacter-levelMultilingualityMultidomainAutomatic dialog act recognition is an important step for dialog systems since it reveals the intention behind the words uttered by its conversational partners. Although most approaches on the task use word-level tokenization, there is information at the sub-word level that is related to the function of the words and, consequently, their intention. Thus, in this study, we explored the use of character-level tokenization to capture that information. We explored the use of multiple character windows of different sizes to capture morphological aspects, such as affixes and lemmas, as well as inter-word information. Furthermore, we assessed the importance of punctuation and capitalization for the task. To broaden the conclusions of our study, we performed experiments on dialogs in three languages—English, Spanish, and German—which have different morphological characteristics. Furthermore, the dialogs cover multiple domains and are annotated with both domain-dependent and domain-independent dialog act labels. The achieved results not only show that the character-level approach leads to similar or better performance than the state-of-the-art word-level approaches on the task, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.MDPI AG2019-03-08T15:20:26Z2019-01-01T00:00:00Z20192019-03-08T15:19:55Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/17539eng2078-248910.3390/info10030094Ribeiro, E.Ribeiro, R.de Matos, D. M.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:49:25Zoai:repositorio.iscte-iul.pt:10071/17539Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:24:16.365941Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv A multilingual and multidomain study on dialog act recognition using character-level tokenization
title A multilingual and multidomain study on dialog act recognition using character-level tokenization
spellingShingle A multilingual and multidomain study on dialog act recognition using character-level tokenization
Ribeiro, E.
Dialog act recognition
Character-level
Multilinguality
Multidomain
title_short A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_full A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_fullStr A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_full_unstemmed A multilingual and multidomain study on dialog act recognition using character-level tokenization
title_sort A multilingual and multidomain study on dialog act recognition using character-level tokenization
author Ribeiro, E.
author_facet Ribeiro, E.
Ribeiro, R.
de Matos, D. M.
author_role author
author2 Ribeiro, R.
de Matos, D. M.
author2_role author
author
dc.contributor.author.fl_str_mv Ribeiro, E.
Ribeiro, R.
de Matos, D. M.
dc.subject.por.fl_str_mv Dialog act recognition
Character-level
Multilinguality
Multidomain
topic Dialog act recognition
Character-level
Multilinguality
Multidomain
description Automatic dialog act recognition is an important step for dialog systems since it reveals the intention behind the words uttered by its conversational partners. Although most approaches on the task use word-level tokenization, there is information at the sub-word level that is related to the function of the words and, consequently, their intention. Thus, in this study, we explored the use of character-level tokenization to capture that information. We explored the use of multiple character windows of different sizes to capture morphological aspects, such as affixes and lemmas, as well as inter-word information. Furthermore, we assessed the importance of punctuation and capitalization for the task. To broaden the conclusions of our study, we performed experiments on dialogs in three languages—English, Spanish, and German—which have different morphological characteristics. Furthermore, the dialogs cover multiple domains and are annotated with both domain-dependent and domain-independent dialog act labels. The achieved results not only show that the character-level approach leads to similar or better performance than the state-of-the-art word-level approaches on the task, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.
publishDate 2019
dc.date.none.fl_str_mv 2019-03-08T15:20:26Z
2019-01-01T00:00:00Z
2019
2019-03-08T15:19:55Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/17539
url http://hdl.handle.net/10071/17539
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 2078-2489
10.3390/info10030094
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv MDPI AG
publisher.none.fl_str_mv MDPI AG
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134805154070528