An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset

Detalhes bibliográficos
Autor(a) principal: Alagarsamy,Sandhya
Data de Publicação: 2022
Outros Autores: James,Visumathi, Raj,Raja Soosaimarian Peter
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Brazilian Archives of Biology and Technology
Texto Completo: http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617
Resumo: Abstract Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset.
id TECPAR-1_f10e7ad3c8eb0289858dde831ac2330e
oai_identifier_str oai:scielo:S1516-89132022000100617
network_acronym_str TECPAR-1
network_name_str Brazilian Archives of Biology and Technology
repository_id_str
spelling An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review DatasetHybridWord EmbeddingNatural Language ProcessingDeep Neural NetworkText ClassificationCNN.Abstract Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset.Instituto de Tecnologia do Paraná - Tecpar2022-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617Brazilian Archives of Biology and Technology v.65 2022reponame:Brazilian Archives of Biology and Technologyinstname:Instituto de Tecnologia do Paraná (Tecpar)instacron:TECPAR10.1590/1678-4324-2022210830info:eu-repo/semantics/openAccessAlagarsamy,SandhyaJames,VisumathiRaj,Raja Soosaimarian Petereng2022-08-17T00:00:00Zoai:scielo:S1516-89132022000100617Revistahttps://www.scielo.br/j/babt/https://old.scielo.br/oai/scielo-oai.phpbabt@tecpar.br||babt@tecpar.br1678-43241516-8913opendoar:2022-08-17T00:00Brazilian Archives of Biology and Technology - Instituto de Tecnologia do Paraná (Tecpar)false
dc.title.none.fl_str_mv An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
title An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
spellingShingle An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
Alagarsamy,Sandhya
HybridWord Embedding
Natural Language Processing
Deep Neural Network
Text Classification
CNN.
title_short An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
title_full An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
title_fullStr An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
title_full_unstemmed An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
title_sort An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
author Alagarsamy,Sandhya
author_facet Alagarsamy,Sandhya
James,Visumathi
Raj,Raja Soosaimarian Peter
author_role author
author2 James,Visumathi
Raj,Raja Soosaimarian Peter
author2_role author
author
dc.contributor.author.fl_str_mv Alagarsamy,Sandhya
James,Visumathi
Raj,Raja Soosaimarian Peter
dc.subject.por.fl_str_mv HybridWord Embedding
Natural Language Processing
Deep Neural Network
Text Classification
CNN.
topic HybridWord Embedding
Natural Language Processing
Deep Neural Network
Text Classification
CNN.
description Abstract Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset.
publishDate 2022
dc.date.none.fl_str_mv 2022-01-01
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617
url http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 10.1590/1678-4324-2022210830
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv text/html
dc.publisher.none.fl_str_mv Instituto de Tecnologia do Paraná - Tecpar
publisher.none.fl_str_mv Instituto de Tecnologia do Paraná - Tecpar
dc.source.none.fl_str_mv Brazilian Archives of Biology and Technology v.65 2022
reponame:Brazilian Archives of Biology and Technology
instname:Instituto de Tecnologia do Paraná (Tecpar)
instacron:TECPAR
instname_str Instituto de Tecnologia do Paraná (Tecpar)
instacron_str TECPAR
institution TECPAR
reponame_str Brazilian Archives of Biology and Technology
collection Brazilian Archives of Biology and Technology
repository.name.fl_str_mv Brazilian Archives of Biology and Technology - Instituto de Tecnologia do Paraná (Tecpar)
repository.mail.fl_str_mv babt@tecpar.br||babt@tecpar.br
_version_ 1750318281679437824