Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts

Detalhes bibliográficos
Autor(a) principal: Guerreiro, N. M.
Data de Publicação: 2021
Outros Autores: Rei, R., Batista, F.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/23064
Resumo: This paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.
id RCAP_f87e3243a70b8f1a1db4331e84e5414b
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/23064
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Towards better subtitles: A multilingual approach for punctuation restoration of speech transcriptsPunctuation marksIntelligent subtitlesPre-trained embeddingsSpeech transcriptsSentence boundariesMultilingual embeddingsThis paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.Pergamon/Elsevier2023-08-14T00:00:00Z2021-01-01T00:00:00Z20212021-09-01T16:08:22Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/23064eng0957-417410.1016/j.eswa.2021.115740Guerreiro, N. M.Rei, R.Batista, F.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:37:35Zoai:repositorio.iscte-iul.pt:10071/23064Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:17:09.491119Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
spellingShingle Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
Guerreiro, N. M.
Punctuation marks
Intelligent subtitles
Pre-trained embeddings
Speech transcripts
Sentence boundaries
Multilingual embeddings
title_short Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_full Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_fullStr Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_full_unstemmed Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_sort Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
author Guerreiro, N. M.
author_facet Guerreiro, N. M.
Rei, R.
Batista, F.
author_role author
author2 Rei, R.
Batista, F.
author2_role author
author
dc.contributor.author.fl_str_mv Guerreiro, N. M.
Rei, R.
Batista, F.
dc.subject.por.fl_str_mv Punctuation marks
Intelligent subtitles
Pre-trained embeddings
Speech transcripts
Sentence boundaries
Multilingual embeddings
topic Punctuation marks
Intelligent subtitles
Pre-trained embeddings
Speech transcripts
Sentence boundaries
Multilingual embeddings
description This paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.
publishDate 2021
dc.date.none.fl_str_mv 2021-01-01T00:00:00Z
2021
2021-09-01T16:08:22Z
2023-08-14T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/23064
url http://hdl.handle.net/10071/23064
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 0957-4174
10.1016/j.eswa.2021.115740
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Pergamon/Elsevier
publisher.none.fl_str_mv Pergamon/Elsevier
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134730411573248