Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news

Detalhes bibliográficos
Autor(a) principal: Batista, F.
Data de Publicação: 2008
Outros Autores: Caseiro, D., Mamede, N., Trancoso, I.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/22063
Resumo: The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts.
id RCAP_d01731226c0c0e187e3f0d430d1564a7
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/22063
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast newsRich transcriptionPunctuation recoverySentence boundary detectionCapitalizationTruecasingMaximum entropyLanguage modelingWeighted finite state transducersThe following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts.Elsevier2021-02-18T10:47:13Z2008-01-01T00:00:00Z20082021-02-18T10:45:22Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/22063eng0167-639310.1016/j.specom.2008.05.008Batista, F.Caseiro, D.Mamede, N.Trancoso, I.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:36:00Zoai:repositorio.iscte-iul.pt:10071/22063Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:16:18.926992Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
title Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
spellingShingle Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
Batista, F.
Rich transcription
Punctuation recovery
Sentence boundary detection
Capitalization
Truecasing
Maximum entropy
Language modeling
Weighted finite state transducers
title_short Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
title_full Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
title_fullStr Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
title_full_unstemmed Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
title_sort Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
author Batista, F.
author_facet Batista, F.
Caseiro, D.
Mamede, N.
Trancoso, I.
author_role author
author2 Caseiro, D.
Mamede, N.
Trancoso, I.
author2_role author
author
author
dc.contributor.author.fl_str_mv Batista, F.
Caseiro, D.
Mamede, N.
Trancoso, I.
dc.subject.por.fl_str_mv Rich transcription
Punctuation recovery
Sentence boundary detection
Capitalization
Truecasing
Maximum entropy
Language modeling
Weighted finite state transducers
topic Rich transcription
Punctuation recovery
Sentence boundary detection
Capitalization
Truecasing
Maximum entropy
Language modeling
Weighted finite state transducers
description The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts.
publishDate 2008
dc.date.none.fl_str_mv 2008-01-01T00:00:00Z
2008
2021-02-18T10:47:13Z
2021-02-18T10:45:22Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/22063
url http://hdl.handle.net/10071/22063
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 0167-6393
10.1016/j.specom.2008.05.008
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134721601437696