An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset
Main Author: | |
---|---|
Publication Date: | 2022 |
Other Authors: | , |
Format: | Article |
Language: | eng |
Source: | Brazilian Archives of Biology and Technology |
Download full: | http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617 |
Summary: | Abstract Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset. |
id |
TECPAR-1_f10e7ad3c8eb0289858dde831ac2330e |
---|---|
oai_identifier_str |
oai:scielo:S1516-89132022000100617 |
network_acronym_str |
TECPAR-1 |
network_name_str |
Brazilian Archives of Biology and Technology |
repository_id_str |
|
spelling |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review DatasetHybridWord EmbeddingNatural Language ProcessingDeep Neural NetworkText ClassificationCNN.Abstract Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset.Instituto de Tecnologia do Paraná - Tecpar2022-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617Brazilian Archives of Biology and Technology v.65 2022reponame:Brazilian Archives of Biology and Technologyinstname:Instituto de Tecnologia do Paraná (Tecpar)instacron:TECPAR10.1590/1678-4324-2022210830info:eu-repo/semantics/openAccessAlagarsamy,SandhyaJames,VisumathiRaj,Raja Soosaimarian Petereng2022-08-17T00:00:00Zoai:scielo:S1516-89132022000100617Revistahttps://www.scielo.br/j/babt/https://old.scielo.br/oai/scielo-oai.phpbabt@tecpar.br||babt@tecpar.br1678-43241516-8913opendoar:2022-08-17T00:00Brazilian Archives of Biology and Technology - Instituto de Tecnologia do Paraná (Tecpar)false |
dc.title.none.fl_str_mv |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
title |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
spellingShingle |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset Alagarsamy,Sandhya HybridWord Embedding Natural Language Processing Deep Neural Network Text Classification CNN. |
title_short |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
title_full |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
title_fullStr |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
title_full_unstemmed |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
title_sort |
An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset |
author |
Alagarsamy,Sandhya |
author_facet |
Alagarsamy,Sandhya James,Visumathi Raj,Raja Soosaimarian Peter |
author_role |
author |
author2 |
James,Visumathi Raj,Raja Soosaimarian Peter |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Alagarsamy,Sandhya James,Visumathi Raj,Raja Soosaimarian Peter |
dc.subject.por.fl_str_mv |
HybridWord Embedding Natural Language Processing Deep Neural Network Text Classification CNN. |
topic |
HybridWord Embedding Natural Language Processing Deep Neural Network Text Classification CNN. |
description |
Abstract Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-01-01 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617 |
url |
http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1516-89132022000100617 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
10.1590/1678-4324-2022210830 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
text/html |
dc.publisher.none.fl_str_mv |
Instituto de Tecnologia do Paraná - Tecpar |
publisher.none.fl_str_mv |
Instituto de Tecnologia do Paraná - Tecpar |
dc.source.none.fl_str_mv |
Brazilian Archives of Biology and Technology v.65 2022 reponame:Brazilian Archives of Biology and Technology instname:Instituto de Tecnologia do Paraná (Tecpar) instacron:TECPAR |
instname_str |
Instituto de Tecnologia do Paraná (Tecpar) |
instacron_str |
TECPAR |
institution |
TECPAR |
reponame_str |
Brazilian Archives of Biology and Technology |
collection |
Brazilian Archives of Biology and Technology |
repository.name.fl_str_mv |
Brazilian Archives of Biology and Technology - Instituto de Tecnologia do Paraná (Tecpar) |
repository.mail.fl_str_mv |
babt@tecpar.br||babt@tecpar.br |
_version_ |
1750318281679437824 |