Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts

Guerreiro, N. M.; Rei, R.; Batista, F.

Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts

Detalhes bibliográficos
Autor(a) principal:	Guerreiro, N. M.
Data de Publicação:	2021
Outros Autores:	Rei, R., Batista, F.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10071/23064
Resumo:	This paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.

Metadados do item

id	RCAP_f87e3243a70b8f1a1db4331e84e5414b
oai_identifier_str	oai:repositorio.iscte-iul.pt:10071/23064
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcriptsPunctuation marksIntelligent subtitlesPre-trained embeddingsSpeech transcriptsSentence boundariesMultilingual embeddingsThis paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.Pergamon/Elsevier2023-08-14T00:00:00Z2021-01-01T00:00:00Z20212021-09-01T16:08:22Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/23064eng0957-417410.1016/j.eswa.2021.115740Guerreiro, N. M.Rei, R.Batista, F.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:37:35Zoai:repositorio.iscte-iul.pt:10071/23064Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:17:09.491119Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
spellingShingle	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts Guerreiro, N. M. Punctuation marks Intelligent subtitles Pre-trained embeddings Speech transcripts Sentence boundaries Multilingual embeddings
title_short	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_full	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_fullStr	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_full_unstemmed	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
title_sort	Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
author	Guerreiro, N. M.
author_facet	Guerreiro, N. M. Rei, R. Batista, F.
author_role	author
author2	Rei, R. Batista, F.
author2_role	author author
dc.contributor.author.fl_str_mv	Guerreiro, N. M. Rei, R. Batista, F.
dc.subject.por.fl_str_mv	Punctuation marks Intelligent subtitles Pre-trained embeddings Speech transcripts Sentence boundaries Multilingual embeddings
topic	Punctuation marks Intelligent subtitles Pre-trained embeddings Speech transcripts Sentence boundaries Multilingual embeddings
description	This paper proposes a flexible approach for punctuation prediction that can be used to produce state-of-the-art results in a multilingual scenario. We have performed experiments using transcripts of TED Talks from the IWSLT 2017 and IWSLT 2011 evaluation campaigns. Our experiments show that the recognition errors of the ASR output degrade the performance of our models, in line with related literature. Our monolingual models perform consistently in Human-edited transcripts of German, Dutch, Portuguese and Romanian, suggesting that commas may be more difficult to predict than periods, using pre-trained contextual models. We have trained a single multilingual model that predicts punctuation in multiple languages that achieves results comparable with the ones achieved by monolingual models, revealing evidence of the potential of using a single multilingual model to solve the task for multiple languages. Then, we argue that usage of current punctuation systems in the literature are implicitly dependent on correct segmentation of ASR outputs for they rely on positional information to solve the punctuation task. This is too big of a requirement for use in a real life application. Through several experiments, we show that our method to train and test models is more robust to different segmentation. These contributions are of particular importance in our multilingual pipeline, since they avoid training a different model for each of the involved languages, and they guarantee that the model will be more robust to incorrect segmentation of the ASR outputs in comparison with other methods in the literature. To the best of our knowledge, we report the first experiments using a single multilingual model for punctuation restoration in multiple languages.
publishDate	2021
dc.date.none.fl_str_mv	2021-01-01T00:00:00Z 2021 2021-09-01T16:08:22Z 2023-08-14T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10071/23064
url	http://hdl.handle.net/10071/23064
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	0957-4174 10.1016/j.eswa.2021.115740
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Pergamon/Elsevier
publisher.none.fl_str_mv	Pergamon/Elsevier
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134730411573248

Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts

Registros relacionados