Speaker Diarization using Artificial Intelligence Techniques
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/104277 |
Resumo: | The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2. |
id |
RCAP_8c68ad983e7acf34dfa0df8f9a8f9964 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/104277 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Speaker Diarization using Artificial Intelligence TechniquesSpeaker DiarizationMachine LearningDeep Learningd-vectorSpectral ClusteringLSTMDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaThe goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.Fonseca, JoséVaras Gonzalez, DavidRUNRosário, João Miguel Pinto Carrilho do2023-07-06T00:30:50Z2020-07-0620202020-07-06T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/104277enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:49:49Zoai:run.unl.pt:10362/104277Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:40:11.733259Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Speaker Diarization using Artificial Intelligence Techniques |
title |
Speaker Diarization using Artificial Intelligence Techniques |
spellingShingle |
Speaker Diarization using Artificial Intelligence Techniques Rosário, João Miguel Pinto Carrilho do Speaker Diarization Machine Learning Deep Learning d-vector Spectral Clustering LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
Speaker Diarization using Artificial Intelligence Techniques |
title_full |
Speaker Diarization using Artificial Intelligence Techniques |
title_fullStr |
Speaker Diarization using Artificial Intelligence Techniques |
title_full_unstemmed |
Speaker Diarization using Artificial Intelligence Techniques |
title_sort |
Speaker Diarization using Artificial Intelligence Techniques |
author |
Rosário, João Miguel Pinto Carrilho do |
author_facet |
Rosário, João Miguel Pinto Carrilho do |
author_role |
author |
dc.contributor.none.fl_str_mv |
Fonseca, José Varas Gonzalez, David RUN |
dc.contributor.author.fl_str_mv |
Rosário, João Miguel Pinto Carrilho do |
dc.subject.por.fl_str_mv |
Speaker Diarization Machine Learning Deep Learning d-vector Spectral Clustering LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
Speaker Diarization Machine Learning Deep Learning d-vector Spectral Clustering LSTM Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-07-06 2020 2020-07-06T00:00:00Z 2023-07-06T00:30:50Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/104277 |
url |
http://hdl.handle.net/10362/104277 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138017365983232 |