Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Outros Autores: | , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10071/23064 |
Resumo: | This paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages. |
id |
RCAP_f87e3243a70b8f1a1db4331e84e5414b |
---|---|
oai_identifier_str |
oai:repositorio.iscte-iul.pt:10071/23064 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcriptsPunctuation marksIntelligent subtitlesPre-trained embeddingsSpeech transcriptsSentence boundariesMultilingual embeddingsThis paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.Pergamon/Elsevier2023-08-14T00:00:00Z2021-01-01T00:00:00Z20212021-09-01T16:08:22Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/23064eng0957-417410.1016/j.eswa.2021.115740Guerreiro, N. M.Rei, R.Batista, F.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:37:35Zoai:repositorio.iscte-iul.pt:10071/23064Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:17:09.491119Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
title |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
spellingShingle |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts Guerreiro, N. M. Punctuation marks Intelligent subtitles Pre-trained embeddings Speech transcripts Sentence boundaries Multilingual embeddings |
title_short |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
title_full |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
title_fullStr |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
title_full_unstemmed |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
title_sort |
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts |
author |
Guerreiro, N. M. |
author_facet |
Guerreiro, N. M. Rei, R. Batista, F. |
author_role |
author |
author2 |
Rei, R. Batista, F. |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Guerreiro, N. M. Rei, R. Batista, F. |
dc.subject.por.fl_str_mv |
Punctuation marks Intelligent subtitles Pre-trained embeddings Speech transcripts Sentence boundaries Multilingual embeddings |
topic |
Punctuation marks Intelligent subtitles Pre-trained embeddings Speech transcripts Sentence boundaries Multilingual embeddings |
description |
This paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-01-01T00:00:00Z 2021 2021-09-01T16:08:22Z 2023-08-14T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10071/23064 |
url |
http://hdl.handle.net/10071/23064 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
0957-4174 10.1016/j.eswa.2021.115740 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Pergamon/Elsevier |
publisher.none.fl_str_mv |
Pergamon/Elsevier |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134730411573248 |