Speaker Diarization using Artificial Intelligence Techniques

Rosário, João Miguel Pinto Carrilho do

Speaker Diarization using Artificial Intelligence Techniques

Detalhes bibliográficos
Autor(a) principal:	Rosário, João Miguel Pinto Carrilho do
Data de Publicação:	2020
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10362/104277
Resumo:	The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.

Metadados do item

id	RCAP_8c68ad983e7acf34dfa0df8f9a8f9964
oai_identifier_str	oai:run.unl.pt:10362/104277
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Speaker Diarization using Artificial Intelligence TechniquesSpeaker DiarizationMachine LearningDeep Learningd-vectorSpectral ClusteringLSTMDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaThe goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.Fonseca, JoséVaras Gonzalez, DavidRUNRosário, João Miguel Pinto Carrilho do2023-07-06T00:30:50Z2020-07-0620202020-07-06T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/104277enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:49:49Zoai:run.unl.pt:10362/104277Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:40:11.733259Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Speaker Diarization using Artificial Intelligence Techniques
title	Speaker Diarization using Artificial Intelligence Techniques
spellingShingle	Speaker Diarization using Artificial Intelligence Techniques Rosário, João Miguel Pinto Carrilho do Speaker Diarization Machine Learning Deep Learning d-vector Spectral Clustering LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short	Speaker Diarization using Artificial Intelligence Techniques
title_full	Speaker Diarization using Artificial Intelligence Techniques
title_fullStr	Speaker Diarization using Artificial Intelligence Techniques
title_full_unstemmed	Speaker Diarization using Artificial Intelligence Techniques
title_sort	Speaker Diarization using Artificial Intelligence Techniques
author	Rosário, João Miguel Pinto Carrilho do
author_facet	Rosário, João Miguel Pinto Carrilho do
author_role	author
dc.contributor.none.fl_str_mv	Fonseca, José Varas Gonzalez, David RUN
dc.contributor.author.fl_str_mv	Rosário, João Miguel Pinto Carrilho do
dc.subject.por.fl_str_mv	Speaker Diarization Machine Learning Deep Learning d-vector Spectral Clustering LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic	Speaker Diarization Machine Learning Deep Learning d-vector Spectral Clustering LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description	The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.
publishDate	2020
dc.date.none.fl_str_mv	2020-07-06 2020 2020-07-06T00:00:00Z 2023-07-06T00:30:50Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/104277
url	http://hdl.handle.net/10362/104277
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799138017365983232

Speaker Diarization using Artificial Intelligence Techniques

Registros relacionados