Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas

Detalhes bibliográficos
Autor(a) principal: Tunnermann, Daniel
Data de Publicação: 2021
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFG
Texto Completo: http://repositorio.bc.ufg.br/tede/handle/tede/12724
Resumo: The popularization of computer programs capable of emulating a dialogue between machines and people, known as chatbots, has driven the development of human-computer interface solutions. In this context, there is a relevant demand in the development of conversational voice interfaces that include at least the ability of the machine to understand words and synthesize voice. The use of Neural Networks has led to a new state of the art for speech synthesis. Mean Opinion Score(MOS) tests show that the speech synthesized by this method has a quality similar to speech recorded in studio by humans. Even with this quality, these methods have difficulty to reproduce the various ways of speaking the same text, to convey information that goes beyond the content, such as emotion, intensity, speed and emphasis. Therefore, new models have been developed to control the style of the generated speech and to transfer style from one audio segment to others. Despite these recent advances, the studies carried out are concentrated on the synthesis of texts in English or Mandarin. The application of style control methods to produce variations in Brazilian Portuguese is also scarce or non-existent. The research presented here developed a neural network architecture for speech synthesis in Brazilian Portuguese capable of controlling the style of synthesized speech. This control allows pitch and velocity changes. In MOS evaluation, the constructed model obtained 4.1 on a scale from 1(Poor) to 5(Excellent), validating the subjective evaluation of good quality in synthesized audios. Examples of audio generated by the developed models can be seen at shorturl.at/etFJP and https://mrfalante.com.br/sobre. Real-time synthesis using models resulting from this research can be performed at https://cybervox.ai.
id UFG-2_fa787c75d800ee75309e8084afd218dc
oai_identifier_str oai:repositorio.bc.ufg.br:tede/12724
network_acronym_str UFG-2
network_name_str Repositório Institucional da UFG
repository_id_str
spelling Soares, Anderson da Silvahttp://lattes.cnpq.br/1096941114079527Soares, Anderson da SilvaGalvão Filho, Arlindo RodriguesGonçalves, Cristhianehttp://lattes.cnpq.br/7894945584957831Tunnermann, Daniel2023-04-04T11:01:27Z2023-04-04T11:01:27Z2021-08-26TUNNERMANN, Daniel. Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas. 2021. 50 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2021.http://repositorio.bc.ufg.br/tede/handle/tede/12724The popularization of computer programs capable of emulating a dialogue between machines and people, known as chatbots, has driven the development of human-computer interface solutions. In this context, there is a relevant demand in the development of conversational voice interfaces that include at least the ability of the machine to understand words and synthesize voice. The use of Neural Networks has led to a new state of the art for speech synthesis. Mean Opinion Score(MOS) tests show that the speech synthesized by this method has a quality similar to speech recorded in studio by humans. Even with this quality, these methods have difficulty to reproduce the various ways of speaking the same text, to convey information that goes beyond the content, such as emotion, intensity, speed and emphasis. Therefore, new models have been developed to control the style of the generated speech and to transfer style from one audio segment to others. Despite these recent advances, the studies carried out are concentrated on the synthesis of texts in English or Mandarin. The application of style control methods to produce variations in Brazilian Portuguese is also scarce or non-existent. The research presented here developed a neural network architecture for speech synthesis in Brazilian Portuguese capable of controlling the style of synthesized speech. This control allows pitch and velocity changes. In MOS evaluation, the constructed model obtained 4.1 on a scale from 1(Poor) to 5(Excellent), validating the subjective evaluation of good quality in synthesized audios. Examples of audio generated by the developed models can be seen at shorturl.at/etFJP and https://mrfalante.com.br/sobre. Real-time synthesis using models resulting from this research can be performed at https://cybervox.ai.A popularização de programas de computador capazes de emular um diálogo entre máquinas e pessoas, os denominados, chatbots, tem impulsionado o desenvolvimento de soluções de interface humano-computador. Nesse contexto, existe uma demanda relevante no desenvolvimento de interfaces conversacionais de voz que incluem no mínimo a capacidade da máquina de compreender palavras e de sintetizar voz. O uso de Redes Neurais levou a um novo estado da arte para a síntese de voz. Testes de Mean Opinion Score(MOS) mostram que as falas sintetizadas por este método tem qualidade semelhante às vozes gravadas em estúdio por humanos. Mesmo com essa qualidade, esses métodos tem dificuldade para reproduzir as várias formas de falar o mesmo texto, para transmitir informações que vão além do conteúdo, como a emoção, intensidade, velocidade e ênfase. Por isso, novos modelos tem sido desenvolvidos para controlar o estilo das vozes geradas e para a transferência de estilo de um segmento de áudio para outros. Apesar destes avanços recentes, os estudos realizados são concentrados na síntese de textos em inglês ou mandarim. A aplicação de métodos de controle de estilo para produzir variações no português brasileiro também é escassa ou inexistente. A pesquisa aqui apresentada desenvolveu uma arquitetura de redes neurais para a síntese de voz em português do Brasil capaz de controlar o estilo da voz sintetizada. Este controle permite alterações de entonação e velocidade. Em avaliação de MOS o modelo construído obteve 4.1 em uma escala de 1(Ruim) a 5(Excelente), validando a avaliação subjetiva de uma boa qualidade nos áudios sintetizados. Exemplos de áudios gerados pelos modelos desenvolvidos podem ser conferidos em shorturl.at/etFJP e https://mrfalante.com.br/sobre. Síntese em tempo real usando modelos resultantes desta pesquisa pode ser realizada em https://cybervox.ai.Submitted by Marlene Santos (marlene.bc.ufg@gmail.com) on 2023-04-03T19:22:35Z No. of bitstreams: 2 Dissertação - Daniel Tunnermann - 2021.pdf: 2429803 bytes, checksum: 4242667c233ba237068b5060d827927b (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2023-04-04T11:01:27Z (GMT) No. of bitstreams: 2 Dissertação - Daniel Tunnermann - 2021.pdf: 2429803 bytes, checksum: 4242667c233ba237068b5060d827927b (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Made available in DSpace on 2023-04-04T11:01:27Z (GMT). No. of bitstreams: 2 Dissertação - Daniel Tunnermann - 2021.pdf: 2429803 bytes, checksum: 4242667c233ba237068b5060d827927b (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5) Previous issue date: 2021-08-26OutroporUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação (INF)UFGBrasilInstituto de Informática - INF (RMG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessSíntese de vozText-to-speechTransferência de estiloRedes neuraisSpeech synthesisText-to-speechStyle transferNeural networksCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAOControle de estilo na síntese de voz em português brasileiro usando redes neurais profundasSpeech synthesis with Style control in brazilian portuguese using neural networksinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis20500500500500261255reponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGLICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/2d871467-c58a-4fa5-83c2-ea03b87ab715/download8a4605be74aa9ea9d79846c1fba20a33MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/929fca6f-93c4-43c4-8355-9c7fa5a88543/download4460e5956bc1d1639be9ae6146a50347MD52ORIGINALDissertação - Daniel Tunnermann - 2021.pdfDissertação - Daniel Tunnermann - 2021.pdfapplication/pdf2429803http://repositorio.bc.ufg.br/tede/bitstreams/a54f3de1-c41b-4527-b4f7-1e1b2690c8fe/download4242667c233ba237068b5060d827927bMD53tede/127242023-04-04 08:01:28.411http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/12724http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttp://repositorio.bc.ufg.br/oai/requesttasesdissertacoes.bc@ufg.bropendoar:2023-04-04T11:01:28Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
dc.title.pt_BR.fl_str_mv Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
dc.title.alternative.eng.fl_str_mv Speech synthesis with Style control in brazilian portuguese using neural networks
title Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
spellingShingle Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
Tunnermann, Daniel
Síntese de voz
Text-to-speech
Transferência de estilo
Redes neurais
Speech synthesis
Text-to-speech
Style transfer
Neural networks
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
title_short Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
title_full Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
title_fullStr Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
title_full_unstemmed Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
title_sort Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas
author Tunnermann, Daniel
author_facet Tunnermann, Daniel
author_role author
dc.contributor.advisor1.fl_str_mv Soares, Anderson da Silva
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/1096941114079527
dc.contributor.referee1.fl_str_mv Soares, Anderson da Silva
dc.contributor.referee2.fl_str_mv Galvão Filho, Arlindo Rodrigues
dc.contributor.referee3.fl_str_mv Gonçalves, Cristhiane
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/7894945584957831
dc.contributor.author.fl_str_mv Tunnermann, Daniel
contributor_str_mv Soares, Anderson da Silva
Soares, Anderson da Silva
Galvão Filho, Arlindo Rodrigues
Gonçalves, Cristhiane
dc.subject.por.fl_str_mv Síntese de voz
Text-to-speech
Transferência de estilo
Redes neurais
topic Síntese de voz
Text-to-speech
Transferência de estilo
Redes neurais
Speech synthesis
Text-to-speech
Style transfer
Neural networks
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
dc.subject.eng.fl_str_mv Speech synthesis
Text-to-speech
Style transfer
Neural networks
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
description The popularization of computer programs capable of emulating a dialogue between machines and people, known as chatbots, has driven the development of human-computer interface solutions. In this context, there is a relevant demand in the development of conversational voice interfaces that include at least the ability of the machine to understand words and synthesize voice. The use of Neural Networks has led to a new state of the art for speech synthesis. Mean Opinion Score(MOS) tests show that the speech synthesized by this method has a quality similar to speech recorded in studio by humans. Even with this quality, these methods have difficulty to reproduce the various ways of speaking the same text, to convey information that goes beyond the content, such as emotion, intensity, speed and emphasis. Therefore, new models have been developed to control the style of the generated speech and to transfer style from one audio segment to others. Despite these recent advances, the studies carried out are concentrated on the synthesis of texts in English or Mandarin. The application of style control methods to produce variations in Brazilian Portuguese is also scarce or non-existent. The research presented here developed a neural network architecture for speech synthesis in Brazilian Portuguese capable of controlling the style of synthesized speech. This control allows pitch and velocity changes. In MOS evaluation, the constructed model obtained 4.1 on a scale from 1(Poor) to 5(Excellent), validating the subjective evaluation of good quality in synthesized audios. Examples of audio generated by the developed models can be seen at shorturl.at/etFJP and https://mrfalante.com.br/sobre. Real-time synthesis using models resulting from this research can be performed at https://cybervox.ai.
publishDate 2021
dc.date.issued.fl_str_mv 2021-08-26
dc.date.accessioned.fl_str_mv 2023-04-04T11:01:27Z
dc.date.available.fl_str_mv 2023-04-04T11:01:27Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv TUNNERMANN, Daniel. Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas. 2021. 50 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2021.
dc.identifier.uri.fl_str_mv http://repositorio.bc.ufg.br/tede/handle/tede/12724
identifier_str_mv TUNNERMANN, Daniel. Controle de estilo na síntese de voz em português brasileiro usando redes neurais profundas. 2021. 50 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Goiás, Goiânia, 2021.
url http://repositorio.bc.ufg.br/tede/handle/tede/12724
dc.language.iso.fl_str_mv por
language por
dc.relation.program.fl_str_mv 20
dc.relation.confidence.fl_str_mv 500
500
500
500
dc.relation.department.fl_str_mv 26
dc.relation.cnpq.fl_str_mv 125
dc.relation.sponsorship.fl_str_mv 5
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Goiás
dc.publisher.program.fl_str_mv Programa de Pós-graduação em Ciência da Computação (INF)
dc.publisher.initials.fl_str_mv UFG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Instituto de Informática - INF (RMG)
publisher.none.fl_str_mv Universidade Federal de Goiás
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFG
instname:Universidade Federal de Goiás (UFG)
instacron:UFG
instname_str Universidade Federal de Goiás (UFG)
instacron_str UFG
institution UFG
reponame_str Repositório Institucional da UFG
collection Repositório Institucional da UFG
bitstream.url.fl_str_mv http://repositorio.bc.ufg.br/tede/bitstreams/2d871467-c58a-4fa5-83c2-ea03b87ab715/download
http://repositorio.bc.ufg.br/tede/bitstreams/929fca6f-93c4-43c4-8355-9c7fa5a88543/download
http://repositorio.bc.ufg.br/tede/bitstreams/a54f3de1-c41b-4527-b4f7-1e1b2690c8fe/download
bitstream.checksum.fl_str_mv 8a4605be74aa9ea9d79846c1fba20a33
4460e5956bc1d1639be9ae6146a50347
4242667c233ba237068b5060d827927b
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)
repository.mail.fl_str_mv tasesdissertacoes.bc@ufg.br
_version_ 1798044391722450944