A method for lexical tone classification in audio-visual speech

João Vítor Possamai de Menezes; Maria Mendes Cantoni; Denis Burnham; Adriano Vilela Barbosa

A method for lexical tone classification in audio-visual speech

Detalhes bibliográficos
Autor(a) principal:	João Vítor Possamai de Menezes
Data de Publicação:	2020
Outros Autores:	Maria Mendes Cantoni, Denis Burnham, Adriano Vilela Barbosa
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Institucional da UFMG
Texto Completo:	https://doi.org/10.20396/joss.v9i00.14960 http://hdl.handle.net/1843/49361 http://orcid.org/0000-0002-7612-9754 https://orcid.org/0000-0001-9515-1802 http://orcid.org/0000-0002-1980-3458 http://orcid.org/0000-0003-1083-8256
Resumo:	This work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant.

Metadados do item

id	UFMG_010095343611382e74560542b945a6f0
oai_identifier_str	oai:repositorio.ufmg.br:1843/49361
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	2023-02-01T14:11:36Z2023-02-01T14:11:36Z2020993104https://doi.org/10.20396/joss.v9i00.149602236-9740http://hdl.handle.net/1843/49361http://orcid.org/0000-0002-7612-9754https://orcid.org/0000-0001-9515-1802http://orcid.org/0000-0002-1980-3458http://orcid.org/0000-0003-1083-8256This work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant.engUniversidade Federal de Minas GeraisUFMGBrasilFALE - FACULDADE DE LETRASJournal of Speech SciencesFalaMultimodal speechLexical toneCantonese languageStatistical learningLinear discriminant analysisA method for lexical tone classification in audio-visual speechinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttps://econtents.bc.unicamp.br/inpec/index.php/joss/article/view/14960João Vítor Possamai de MenezesMaria Mendes CantoniDenis BurnhamAdriano Vilela Barbosaapplication/pdfinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGLICENSELicense.txtLicense.txttext/plain; charset=utf-82042https://repositorio.ufmg.br/bitstream/1843/49361/1/License.txtfa505098d172de0bc8864fc1287ffe22MD51ORIGINALA method for lexical tone classification in audio-visual speech.pdfA method for lexical tone classification in audio-visual speech.pdfapplication/pdf404294https://repositorio.ufmg.br/bitstream/1843/49361/2/A%20method%20for%20lexical%20tone%20classification%20in%20audio-visual%20speech.pdf02ed22c80b0a3284c01d1fd41e5975deMD521843/493612023-02-01 11:11:37.066oai:repositorio.ufmg.br:1843/49361TElDRU7vv71BIERFIERJU1RSSUJVSe+/ve+/vU8gTu+/vU8tRVhDTFVTSVZBIERPIFJFUE9TSVTvv71SSU8gSU5TVElUVUNJT05BTCBEQSBVRk1HCiAKCkNvbSBhIGFwcmVzZW50Ye+/ve+/vW8gZGVzdGEgbGljZW7vv71hLCB2b2Pvv70gKG8gYXV0b3IgKGVzKSBvdSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGRlIGF1dG9yKSBjb25jZWRlIGFvIFJlcG9zaXTvv71yaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbu+/vW8gZXhjbHVzaXZvIGUgaXJyZXZvZ++/vXZlbCBkZSByZXByb2R1emlyIGUvb3UgZGlzdHJpYnVpciBhIHN1YSBwdWJsaWNh77+977+9byAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0cu+/vW5pY28gZSBlbSBxdWFscXVlciBtZWlvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mg77+9dWRpbyBvdSB277+9ZGVvLgoKVm9j77+9IGRlY2xhcmEgcXVlIGNvbmhlY2UgYSBwb2zvv710aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2Pvv70gY29uY29yZGEgcXVlIG8gUmVwb3NpdO+/vXJpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250Ze+/vWRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNh77+977+9byBwYXJhIHF1YWxxdWVyIG1laW8gb3UgZm9ybWF0byBwYXJhIGZpbnMgZGUgcHJlc2VydmHvv73vv71vLgoKVm9j77+9IHRhbWLvv71tIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTvv71yaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUgbWFudGVyIG1haXMgZGUgdW1hIGPvv71waWEgZGUgc3VhIHB1YmxpY2Hvv73vv71vIHBhcmEgZmlucyBkZSBzZWd1cmFu77+9YSwgYmFjay11cCBlIHByZXNlcnZh77+977+9by4KClZvY++/vSBkZWNsYXJhIHF1ZSBhIHN1YSBwdWJsaWNh77+977+9byDvv70gb3JpZ2luYWwgZSBxdWUgdm9j77+9IHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vu77+9YS4gVm9j77+9IHRhbWLvv71tIGRlY2xhcmEgcXVlIG8gZGVw77+9c2l0byBkZSBzdWEgcHVibGljYe+/ve+/vW8gbu+/vW8sIHF1ZSBzZWphIGRlIHNldSBjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd177+9bS4KCkNhc28gYSBzdWEgcHVibGljYe+/ve+/vW8gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY++/vSBu77+9byBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2Pvv70gZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc++/vW8gaXJyZXN0cml0YSBkbyBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgcGFyYSBjb25jZWRlciBhbyBSZXBvc2l077+9cmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7vv71hLCBlIHF1ZSBlc3NlIG1hdGVyaWFsIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcyBlc3Tvv70gY2xhcmFtZW50ZSBpZGVudGlmaWNhZG8gZSByZWNvbmhlY2lkbyBubyB0ZXh0byBvdSBubyBjb250Ze+/vWRvIGRhIHB1YmxpY2Hvv73vv71vIG9yYSBkZXBvc2l0YWRhLgoKQ0FTTyBBIFBVQkxJQ0Hvv73vv71PIE9SQSBERVBPU0lUQURBIFRFTkhBIFNJRE8gUkVTVUxUQURPIERFIFVNIFBBVFJPQ++/vU5JTyBPVSBBUE9JTyBERSBVTUEgQUfvv71OQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0Pvv70gREVDTEFSQSBRVUUgUkVTUEVJVE9VIFRPRE9TIEUgUVVBSVNRVUVSIERJUkVJVE9TIERFIFJFVklT77+9TyBDT01PIFRBTULvv71NIEFTIERFTUFJUyBPQlJJR0Hvv73vv71FUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKTyBSZXBvc2l077+9cmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNh77+977+9bywgZSBu77+9byBmYXLvv70gcXVhbHF1ZXIgYWx0ZXJh77+977+9bywgYWzvv71tIGRhcXVlbGFzIGNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7vv71hLgo=Repositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2023-02-01T14:11:37Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv	A method for lexical tone classification in audio-visual speech
title	A method for lexical tone classification in audio-visual speech
spellingShingle	A method for lexical tone classification in audio-visual speech João Vítor Possamai de Menezes Multimodal speech Lexical tone Cantonese language Statistical learning Linear discriminant analysis Fala
title_short	A method for lexical tone classification in audio-visual speech
title_full	A method for lexical tone classification in audio-visual speech
title_fullStr	A method for lexical tone classification in audio-visual speech
title_full_unstemmed	A method for lexical tone classification in audio-visual speech
title_sort	A method for lexical tone classification in audio-visual speech
author	João Vítor Possamai de Menezes
author_facet	João Vítor Possamai de Menezes Maria Mendes Cantoni Denis Burnham Adriano Vilela Barbosa
author_role	author
author2	Maria Mendes Cantoni Denis Burnham Adriano Vilela Barbosa
author2_role	author author author
dc.contributor.author.fl_str_mv	João Vítor Possamai de Menezes Maria Mendes Cantoni Denis Burnham Adriano Vilela Barbosa
dc.subject.por.fl_str_mv	Multimodal speech Lexical tone Cantonese language Statistical learning Linear discriminant analysis
topic	Multimodal speech Lexical tone Cantonese language Statistical learning Linear discriminant analysis Fala
dc.subject.other.pt_BR.fl_str_mv	Fala
description	This work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant.
publishDate	2020
dc.date.issued.fl_str_mv	2020
dc.date.accessioned.fl_str_mv	2023-02-01T14:11:36Z
dc.date.available.fl_str_mv	2023-02-01T14:11:36Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1843/49361
dc.identifier.doi.pt_BR.fl_str_mv	https://doi.org/10.20396/joss.v9i00.14960
dc.identifier.issn.pt_BR.fl_str_mv	2236-9740
dc.identifier.orcid.pt_BR.fl_str_mv	http://orcid.org/0000-0002-7612-9754 https://orcid.org/0000-0001-9515-1802 http://orcid.org/0000-0002-1980-3458 http://orcid.org/0000-0003-1083-8256
url	https://doi.org/10.20396/joss.v9i00.14960 http://hdl.handle.net/1843/49361 http://orcid.org/0000-0002-7612-9754 https://orcid.org/0000-0001-9515-1802 http://orcid.org/0000-0002-1980-3458 http://orcid.org/0000-0003-1083-8256
identifier_str_mv	2236-9740
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.ispartof.pt_BR.fl_str_mv	Journal of Speech Sciences
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.publisher.initials.fl_str_mv	UFMG
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	FALE - FACULDADE DE LETRAS
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br/bitstream/1843/49361/1/License.txt https://repositorio.ufmg.br/bitstream/1843/49361/2/A%20method%20for%20lexical%20tone%20classification%20in%20audio-visual%20speech.pdf
bitstream.checksum.fl_str_mv	fa505098d172de0bc8864fc1287ffe22 02ed22c80b0a3284c01d1fd41e5975de
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_	1803589472536756224

A method for lexical tone classification in audio-visual speech

Registros relacionados