Classification of opinionated texts by analogy

Detalhes bibliográficos
Autor(a) principal: Pais, Sebastião
Data de Publicação: 2008
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10400.6/3714
Resumo: With the disproportionate increase of theWorldWideWeb and the quantity of information services and their availability, we have an excessive accumulation of documents of various kinds. Despite the positive aspects this represents and the potential this causes, a new problem arises as we need capable tools and methodologies to classify a document as to its quality. Assessing the quality of a Web page is not easy. For the technical evaluation of the structure of Web pages, many are the works that have emerged. This thesis follows a different course. It seeks to evaluate the content of pages according to the opinions and feelings they highlight. The adopted basis criterion to assess the quality ofWeb pages is to examine the absence of opinions and feelings in the texts. When we consult information from the Web, how do we know exactly that the information is reliable and does not express opinions which are made available to the public feelings? How can we ensure when we read a text that we are not being misled by the author who is expressing his opinion or, once again, his feelings? How can we ensure that our own assessment is free from any judgment of value that we can defend? Because of these questions, the area of "Opinion Mining", "Opinion Retrieval", or "Sentiment Analysis", is worth being investigated as we clearly believe that there is much to discover yet. After a lot of research and reading, we concluded that we do not want to follow the same methodology proposed so far by other researchers. Basically, they work with objective and subjective corpora manually annotated. We think it is a disadvantage because these are limited corpora, once they are small, and cover a limited number of subjects. We disagree with another point. Some researchers only use one or several morphological classes, or specific words as predefined attributes. As we want to identify the degree of objectivity/subjectivity of sentences, and not documents, the more attributes we will have, the more accurate we expect our classification to be. We want to implement another innovation in our method. We want to make it as automatic as possible or, at least, the least supervised as possible. Assessed some gaps in the area, we define our line of intervention for this dissertation. As already mentioned, as a rule, the corpora used in the area of opinions are manually annotated and they are not very inclusive. To tackle this problem we propose to replace these corpora with texts taken from Wikipedia and texts extracted from Weblogs, accessible to any researcher in the area. Thus, Wikipedia should represent objective texts and Weblogs represent subjective texts (which we can consider that is an opinion repository). These new corpora bring great advantages. They are obtained in an automatic way, they are not manually annotated, we can build them at any time and they are very inclusive. To be able to say that Wikipedia may represent objective texts and Weblogs may represent subjective texts, we assess their similarity at various morphological levels, with manually annotated objective/subjective corpora. To evaluate this similarity, we use two different methodologies, the Rocchio Method and the Language Model on a cross-validation basis. By using these two different methodologies, we achieve similar results which confirm our hypothesis. With the success of the step described above, we propose to automatically classify sentences (at various morphological levels) by analogy. At this stage, we use different SVM classifiers and training and test sets built over several corpora on a cross-validation basis, to, once again, have several results to compare to draw our final conclusions. This new concept of quality assessment of a Web page, through the absence of opinions, brings to the scientific community another way of research in the area of opinions. The user in general is also benefited, because he has the chance, when he consults a Web page or uses a search engine, to know with some certainty if the information is true or if this is only one set of opinions/sentiments expressed by the authors, excluding thus their own judgments of value about what he sees.
id RCAP_cf0ac4c52b7c1273fc5fb1300b2c499f
oai_identifier_str oai:ubibliorum.ubi.pt:10400.6/3714
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Classification of opinionated texts by analogyPáginas Web - Avaliação da qualidadePáginas Web - Avaliação do conteúdoLinguagem natural - ProcessamentoOpinion miningOpinion retrievalSentiment analysisRecuperação da informação - Web - AvaliaçãoWith the disproportionate increase of theWorldWideWeb and the quantity of information services and their availability, we have an excessive accumulation of documents of various kinds. Despite the positive aspects this represents and the potential this causes, a new problem arises as we need capable tools and methodologies to classify a document as to its quality. Assessing the quality of a Web page is not easy. For the technical evaluation of the structure of Web pages, many are the works that have emerged. This thesis follows a different course. It seeks to evaluate the content of pages according to the opinions and feelings they highlight. The adopted basis criterion to assess the quality ofWeb pages is to examine the absence of opinions and feelings in the texts. When we consult information from the Web, how do we know exactly that the information is reliable and does not express opinions which are made available to the public feelings? How can we ensure when we read a text that we are not being misled by the author who is expressing his opinion or, once again, his feelings? How can we ensure that our own assessment is free from any judgment of value that we can defend? Because of these questions, the area of "Opinion Mining", "Opinion Retrieval", or "Sentiment Analysis", is worth being investigated as we clearly believe that there is much to discover yet. After a lot of research and reading, we concluded that we do not want to follow the same methodology proposed so far by other researchers. Basically, they work with objective and subjective corpora manually annotated. We think it is a disadvantage because these are limited corpora, once they are small, and cover a limited number of subjects. We disagree with another point. Some researchers only use one or several morphological classes, or specific words as predefined attributes. As we want to identify the degree of objectivity/subjectivity of sentences, and not documents, the more attributes we will have, the more accurate we expect our classification to be. We want to implement another innovation in our method. We want to make it as automatic as possible or, at least, the least supervised as possible. Assessed some gaps in the area, we define our line of intervention for this dissertation. As already mentioned, as a rule, the corpora used in the area of opinions are manually annotated and they are not very inclusive. To tackle this problem we propose to replace these corpora with texts taken from Wikipedia and texts extracted from Weblogs, accessible to any researcher in the area. Thus, Wikipedia should represent objective texts and Weblogs represent subjective texts (which we can consider that is an opinion repository). These new corpora bring great advantages. They are obtained in an automatic way, they are not manually annotated, we can build them at any time and they are very inclusive. To be able to say that Wikipedia may represent objective texts and Weblogs may represent subjective texts, we assess their similarity at various morphological levels, with manually annotated objective/subjective corpora. To evaluate this similarity, we use two different methodologies, the Rocchio Method and the Language Model on a cross-validation basis. By using these two different methodologies, we achieve similar results which confirm our hypothesis. With the success of the step described above, we propose to automatically classify sentences (at various morphological levels) by analogy. At this stage, we use different SVM classifiers and training and test sets built over several corpora on a cross-validation basis, to, once again, have several results to compare to draw our final conclusions. This new concept of quality assessment of a Web page, through the absence of opinions, brings to the scientific community another way of research in the area of opinions. The user in general is also benefited, because he has the chance, when he consults a Web page or uses a search engine, to know with some certainty if the information is true or if this is only one set of opinions/sentiments expressed by the authors, excluding thus their own judgments of value about what he sees.Com o aumento desmedido daWorldWideWeb e da quantidade de serviços de informação e respectiva disponibilização, deparamo-nos actualmente com uma acumulação excessiva de textos de diversas naturezas. Apesar dos aspectos positivos que isto representa e do potencial que acarreta, surge uma nova problemática que consiste na necessidade de existirem ferramentas e metodologias capazes de classificar um documento, quanto à sua qualidade. Avaliar a qualidade de uma página Web não é tarefa fácil. Relativamente às técnicas de avaliação da estrutura das páginas, muitos são os trabalhos que têm surgido. Esta tese segue um rumo diferente, com ela pretende-se avaliar o conteúdo das páginas segundo as opiniões e os sentimentos nelas evidenciados. O critério de base adoptado para avaliar a qualidade das páginas Web é a análise da ausência de opiniões e sentimentos nos textos. Quando consultamos informação proveniente da Web, como sabemos exactamente que essa informação é fiável e que não retrata meras opiniões ou expressa sentimentos de quem a disponibilizou ao público? Como podemos garantir que ao estarmos a ler um texto não estamos a ser induzidos em erro pelo seu autor que está a expressar a sua opinião ou mais uma vez os seus sentimentos? Como podemos garantir que a nossa própria avaliação é isenta de qualquer juízo de valor que possamos defender? Por surgirem estas perguntas, entendemos ser necessário investigar e trabalhar numa área que se denomina "Opinion Mining", "Opinion Retrieval", ou ainda "Sentiment Analysis", onde julgamos existir muito ainda por descobrir. Depois de muita pesquisa e leitura sobre a área em discussão, concluímos que não queríamos seguir a mesma metodologia que outros seguem. Basicamente trabalham com corpora objectivos e corpora subjectivos anotados de forma manual. Pensamos que é uma desvantagem, porque esses corpora são limitativos, uma vez que são pequenos e por isso abrangem um número restrito de assuntos. Outro aspecto acerca do qual discordamos é que alguns investigadores utilizam apenas uma(s) classe(s) morfológica(s), ou palavras predefinidas como características. Como queremos identificar frases, e não só textos, quanto mais características tivermos, mais exacta deverá ser a nossa classificação. Uma outra inovação que queremos implementar é tornar o nosso método o mais automático possível ou, pelo menos, o menos supervisionado possível. Avaliadas algumas lacunas existentes na área, definimos a nossa linha de intervenção para a realização desta dissertação. Como já foi mencionado, por norma, os corpora utilizados na área das opiniões são anotados manualmente e pouco abrangentes. Para combatermos esse problema propomos que para substituir esses mesmos corpora podemos utilizar textos extraídos do Wikipedia e textos extraídos de Weblogs, acessíveis a qualquer investigador na área. Deste modo, o Wikipedia representa textos objectivos e os Weblogs representam textos subjectivos (que podemos considerar que são um repositório de opiniões). Estes novos corpora por nós definidos trazem grandes vantagens: são obtidos de forma automática, não são anotados manualmente, podemos construí-los em qualquer altura, para qualquer língua e são bastante abrangentes. Para podermos afirmar que o Wikipedia representa textos objectivos e que os Weblogs representam textos subjectivos, avaliamos a sua similaridade, a vários níveis morfológicos, com os corpora (objectivos/subjectivos) anotados manualmente. Para avaliar essa similaridade, utilizamos duas metodologias diferentes, o Método de Rocchio e o Modelo da Linguagem, usando em ambos conjuntos de treino e de teste de todos os corpora e o conceito de validação cruzada. Ao utilizarmos estas duas metodologias diferentes, obtivemos resultados diferentes, que foi necessário compararmos para tirarmos as nossas conclusões, que resultaram na aprovação da nossa hipótese. Com o sucesso do passo acima descrito, passamos à classificação de frases (também a vários níveis morfológicos) que podem conter poucas ou muitas palavras. Nesta fase, utilizamos vários classificadores SVM, conjuntos de treino e de teste dos vários corpora e o conceito de validação cruzada, para mais uma vez podermos ter vários resultados que comparamos para tirar as nossas conclusões. Este novo conceito de avaliação da qualidade de uma página Web, através da ausência de opiniões, traz à comunidade científica um outro caminho de investigação na área das opiniões. O utilizador em geral também é beneficiado, pois tem a possibilidade de, ao consultar uma página Web ou efectuar uma pesquisa num motor de busca, saber com alguma certeza se a informação que visualiza é verídica ou se é apenas um conjunto de opiniões/sentimentos expressos pelos autores, excluindo, desta forma, os seus próprios juízos de valor acerca do que está a visualizar.Dias, Gaël Harry Adélio AndréuBibliorumPais, Sebastião2015-07-14T16:21:41Z20082008-082008-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfapplication/pdfhttp://hdl.handle.net/10400.6/3714enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-12-15T09:40:12Zoai:ubibliorum.ubi.pt:10400.6/3714Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T00:45:05.228303Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Classification of opinionated texts by analogy
title Classification of opinionated texts by analogy
spellingShingle Classification of opinionated texts by analogy
Pais, Sebastião
Páginas Web - Avaliação da qualidade
Páginas Web - Avaliação do conteúdo
Linguagem natural - Processamento
Opinion mining
Opinion retrieval
Sentiment analysis
Recuperação da informação - Web - Avaliação
title_short Classification of opinionated texts by analogy
title_full Classification of opinionated texts by analogy
title_fullStr Classification of opinionated texts by analogy
title_full_unstemmed Classification of opinionated texts by analogy
title_sort Classification of opinionated texts by analogy
author Pais, Sebastião
author_facet Pais, Sebastião
author_role author
dc.contributor.none.fl_str_mv Dias, Gaël Harry Adélio André
uBibliorum
dc.contributor.author.fl_str_mv Pais, Sebastião
dc.subject.por.fl_str_mv Páginas Web - Avaliação da qualidade
Páginas Web - Avaliação do conteúdo
Linguagem natural - Processamento
Opinion mining
Opinion retrieval
Sentiment analysis
Recuperação da informação - Web - Avaliação
topic Páginas Web - Avaliação da qualidade
Páginas Web - Avaliação do conteúdo
Linguagem natural - Processamento
Opinion mining
Opinion retrieval
Sentiment analysis
Recuperação da informação - Web - Avaliação
description With the disproportionate increase of theWorldWideWeb and the quantity of information services and their availability, we have an excessive accumulation of documents of various kinds. Despite the positive aspects this represents and the potential this causes, a new problem arises as we need capable tools and methodologies to classify a document as to its quality. Assessing the quality of a Web page is not easy. For the technical evaluation of the structure of Web pages, many are the works that have emerged. This thesis follows a different course. It seeks to evaluate the content of pages according to the opinions and feelings they highlight. The adopted basis criterion to assess the quality ofWeb pages is to examine the absence of opinions and feelings in the texts. When we consult information from the Web, how do we know exactly that the information is reliable and does not express opinions which are made available to the public feelings? How can we ensure when we read a text that we are not being misled by the author who is expressing his opinion or, once again, his feelings? How can we ensure that our own assessment is free from any judgment of value that we can defend? Because of these questions, the area of "Opinion Mining", "Opinion Retrieval", or "Sentiment Analysis", is worth being investigated as we clearly believe that there is much to discover yet. After a lot of research and reading, we concluded that we do not want to follow the same methodology proposed so far by other researchers. Basically, they work with objective and subjective corpora manually annotated. We think it is a disadvantage because these are limited corpora, once they are small, and cover a limited number of subjects. We disagree with another point. Some researchers only use one or several morphological classes, or specific words as predefined attributes. As we want to identify the degree of objectivity/subjectivity of sentences, and not documents, the more attributes we will have, the more accurate we expect our classification to be. We want to implement another innovation in our method. We want to make it as automatic as possible or, at least, the least supervised as possible. Assessed some gaps in the area, we define our line of intervention for this dissertation. As already mentioned, as a rule, the corpora used in the area of opinions are manually annotated and they are not very inclusive. To tackle this problem we propose to replace these corpora with texts taken from Wikipedia and texts extracted from Weblogs, accessible to any researcher in the area. Thus, Wikipedia should represent objective texts and Weblogs represent subjective texts (which we can consider that is an opinion repository). These new corpora bring great advantages. They are obtained in an automatic way, they are not manually annotated, we can build them at any time and they are very inclusive. To be able to say that Wikipedia may represent objective texts and Weblogs may represent subjective texts, we assess their similarity at various morphological levels, with manually annotated objective/subjective corpora. To evaluate this similarity, we use two different methodologies, the Rocchio Method and the Language Model on a cross-validation basis. By using these two different methodologies, we achieve similar results which confirm our hypothesis. With the success of the step described above, we propose to automatically classify sentences (at various morphological levels) by analogy. At this stage, we use different SVM classifiers and training and test sets built over several corpora on a cross-validation basis, to, once again, have several results to compare to draw our final conclusions. This new concept of quality assessment of a Web page, through the absence of opinions, brings to the scientific community another way of research in the area of opinions. The user in general is also benefited, because he has the chance, when he consults a Web page or uses a search engine, to know with some certainty if the information is true or if this is only one set of opinions/sentiments expressed by the authors, excluding thus their own judgments of value about what he sees.
publishDate 2008
dc.date.none.fl_str_mv 2008
2008-08
2008-01-01T00:00:00Z
2015-07-14T16:21:41Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.6/3714
url http://hdl.handle.net/10400.6/3714
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799136347453128704