Speaker Diarization using Artificial Intelligence Techniques

Detalhes bibliográficos
Autor(a) principal: Rosário, João Miguel Pinto Carrilho do
Data de Publicação: 2020
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/104277
Resumo: The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.
id RCAP_8c68ad983e7acf34dfa0df8f9a8f9964
oai_identifier_str oai:run.unl.pt:10362/104277
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Speaker Diarization using Artificial Intelligence TechniquesSpeaker DiarizationMachine LearningDeep Learningd-vectorSpectral ClusteringLSTMDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaThe goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.Fonseca, JoséVaras Gonzalez, DavidRUNRosário, João Miguel Pinto Carrilho do2023-07-06T00:30:50Z2020-07-0620202020-07-06T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/104277enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:49:49Zoai:run.unl.pt:10362/104277Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:40:11.733259Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Speaker Diarization using Artificial Intelligence Techniques
title Speaker Diarization using Artificial Intelligence Techniques
spellingShingle Speaker Diarization using Artificial Intelligence Techniques
Rosário, João Miguel Pinto Carrilho do
Speaker Diarization
Machine Learning
Deep Learning
d-vector
Spectral Clustering
LSTM
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short Speaker Diarization using Artificial Intelligence Techniques
title_full Speaker Diarization using Artificial Intelligence Techniques
title_fullStr Speaker Diarization using Artificial Intelligence Techniques
title_full_unstemmed Speaker Diarization using Artificial Intelligence Techniques
title_sort Speaker Diarization using Artificial Intelligence Techniques
author Rosário, João Miguel Pinto Carrilho do
author_facet Rosário, João Miguel Pinto Carrilho do
author_role author
dc.contributor.none.fl_str_mv Fonseca, José
Varas Gonzalez, David
RUN
dc.contributor.author.fl_str_mv Rosário, João Miguel Pinto Carrilho do
dc.subject.por.fl_str_mv Speaker Diarization
Machine Learning
Deep Learning
d-vector
Spectral Clustering
LSTM
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic Speaker Diarization
Machine Learning
Deep Learning
d-vector
Spectral Clustering
LSTM
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.
publishDate 2020
dc.date.none.fl_str_mv 2020-07-06
2020
2020-07-06T00:00:00Z
2023-07-06T00:30:50Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/104277
url http://hdl.handle.net/10362/104277
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138017365983232