Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing

Detalhes bibliográficos
Autor(a) principal: Roy, Bhupendra
Data de Publicação: 2020
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/101187
Resumo: Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
id RCAP_7eb4c49a9b4e34429d7e20139647b8d5
oai_identifier_str oai:run.unl.pt:10362/101187
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language ProcessingOnline deceptionDeep LearningNatural Language ProcessingNeural NetworkLogistics RegressionNaïve BayesSupport Vector MachineRandom ForestExtreme Gradient BoostingDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsCustomers increasingly rate, review and research products online, (Jansen 2010). Consequently, websites containing consumer reviews are becoming targets of opinion spam. Now-a-days, people are paid money to write fake positive review online, to misguide customer and to augment sales revenue. Alternatively, people are also paid to pose as customers and to post negative fake reviews with the objective to slash competitors. These have caused menace in social media and often resulting in customer being baffled. In this study, we have explored multiple aspects of deception classification. We have explored four kinds of treatments to input i.e., the reviews using Natural Language Processing – lemmatization, stemming, POS tagging and a mix of lemmatization and POS Tagging. Also, we have explored how each of these inputs responds to different machine learning models – Logistic Regression, Naïve Bayes, Support Vector Machine, Random Forest, Extreme Gradient Boosting and Deep Learning Neural Network. We have utilized the gold standard hotel reviews dataset created by (Ott, Choi, et al. 2011) & (Ott, Cardie and Hancock, Negative Deceptive Opinion Spam 2013). Also, we used restaurant reviews dataset and doctors’ reviews dataset used by (Li, et al. 2014). We explored the usability of these models in similar domain as well as across different domains. We trained our model with 75% of hotel reviews dataset and check the accuracy of classification on similar dataset like 25% of unseen hotel reviews and on different domain dataset like unseen restaurant reviews and unseen doctors’ reviews. We perform this to create a robust model which can be applied on same domain and across different domains. Best accuracy for testing dataset of hotels achieved by us was at 91% using Deep Learning Neural Network. Logistic regression, support vector machine and random forest had similar results like neural network. Naïve Bayes also had similar accuracy; however, it had more volatility in cross domain accuracy performance. Accuracy of extreme gradient boosting was weakest among all the models that we explored. Our results are comparable and at times exceeding performance of other researchers’ work. Additionally, we have explored various models (Logistic Regression, Naïve Bayes, Support Vector Machine, Random Forest, Extreme gradient boosting, Neural network) vis a vis various input transformation method using Natural Language Processing (lemmatized unigrams, stemmed, POS tagging and a mix of lemmatization and POS Tagging).Bação, Fernando José Ferreira LucasRUNRoy, Bhupendra2020-07-21T15:08:05Z2020-06-042020-06-04T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/101187TID:202501906enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:47:24Zoai:run.unl.pt:10362/101187Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:39:30.262993Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
title Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
spellingShingle Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
Roy, Bhupendra
Online deception
Deep Learning
Natural Language Processing
Neural Network
Logistics Regression
Naïve Bayes
Support Vector Machine
Random Forest
Extreme Gradient Boosting
title_short Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
title_full Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
title_fullStr Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
title_full_unstemmed Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
title_sort Identifying Deception in Online Reviews: Application of Machine Learning, Deep Learning and Natural Language Processing
author Roy, Bhupendra
author_facet Roy, Bhupendra
author_role author
dc.contributor.none.fl_str_mv Bação, Fernando José Ferreira Lucas
RUN
dc.contributor.author.fl_str_mv Roy, Bhupendra
dc.subject.por.fl_str_mv Online deception
Deep Learning
Natural Language Processing
Neural Network
Logistics Regression
Naïve Bayes
Support Vector Machine
Random Forest
Extreme Gradient Boosting
topic Online deception
Deep Learning
Natural Language Processing
Neural Network
Logistics Regression
Naïve Bayes
Support Vector Machine
Random Forest
Extreme Gradient Boosting
description Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
publishDate 2020
dc.date.none.fl_str_mv 2020-07-21T15:08:05Z
2020-06-04
2020-06-04T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/101187
TID:202501906
url http://hdl.handle.net/10362/101187
identifier_str_mv TID:202501906
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138011482423296