Using linguistic information to classify Portuguese text documents

Detalhes bibliográficos
Autor(a) principal: Teresa, Gonçalves
Data de Publicação: 2008
Outros Autores: Paulo, Quaresma
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10174/1410
Resumo: This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents.
id RCAP_da7c374d7c698e6a2d98a88da1121b09
oai_identifier_str oai:dspace.uevora.pt:10174/1410
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Using linguistic information to classify Portuguese text documentsText classificationSupport vector machinesLinguistic InformationThis paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents.IEEE Computer Society2009-04-06T15:49:04Z2009-04-062008-10-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/article251581 bytesapplication/pdfhttp://hdl.handle.net/10174/1410http://hdl.handle.net/10174/1410eng94-100978-0-7695-3441-11restrito_uetcg@di.uevora.ptpq@di.uevora.pt7th Mexican International Conference on Artificial IntelligenceGelbukh, AlexanderMorales, Eduardo283Teresa, GonçalvesPaulo, Quaresmainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T18:37:20Zoai:dspace.uevora.pt:10174/1410Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T00:57:28.663530Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Using linguistic information to classify Portuguese text documents
title Using linguistic information to classify Portuguese text documents
spellingShingle Using linguistic information to classify Portuguese text documents
Teresa, Gonçalves
Text classification
Support vector machines
Linguistic Information
title_short Using linguistic information to classify Portuguese text documents
title_full Using linguistic information to classify Portuguese text documents
title_fullStr Using linguistic information to classify Portuguese text documents
title_full_unstemmed Using linguistic information to classify Portuguese text documents
title_sort Using linguistic information to classify Portuguese text documents
author Teresa, Gonçalves
author_facet Teresa, Gonçalves
Paulo, Quaresma
author_role author
author2 Paulo, Quaresma
author2_role author
dc.contributor.author.fl_str_mv Teresa, Gonçalves
Paulo, Quaresma
dc.subject.por.fl_str_mv Text classification
Support vector machines
Linguistic Information
topic Text classification
Support vector machines
Linguistic Information
description This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents.
publishDate 2008
dc.date.none.fl_str_mv 2008-10-01T00:00:00Z
2009-04-06T15:49:04Z
2009-04-06
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10174/1410
http://hdl.handle.net/10174/1410
url http://hdl.handle.net/10174/1410
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 94-100
978-0-7695-3441-1
1
restrito_ue
tcg@di.uevora.pt
pq@di.uevora.pt
7th Mexican International Conference on Artificial Intelligence
Gelbukh, Alexander
Morales, Eduardo
283
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv 251581 bytes
application/pdf
dc.publisher.none.fl_str_mv IEEE Computer Society
publisher.none.fl_str_mv IEEE Computer Society
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799136458677682176