Using linguistic information to classify Portuguese text documents
Autor(a) principal: | |
---|---|
Data de Publicação: | 2008 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10174/1410 |
Resumo: | This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents. |
id |
RCAP_da7c374d7c698e6a2d98a88da1121b09 |
---|---|
oai_identifier_str |
oai:dspace.uevora.pt:10174/1410 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Using linguistic information to classify Portuguese text documentsText classificationSupport vector machinesLinguistic InformationThis paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents.IEEE Computer Society2009-04-06T15:49:04Z2009-04-062008-10-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/article251581 bytesapplication/pdfhttp://hdl.handle.net/10174/1410http://hdl.handle.net/10174/1410eng94-100978-0-7695-3441-11restrito_uetcg@di.uevora.ptpq@di.uevora.pt7th Mexican International Conference on Artificial IntelligenceGelbukh, AlexanderMorales, Eduardo283Teresa, GonçalvesPaulo, Quaresmainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T18:37:20Zoai:dspace.uevora.pt:10174/1410Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T00:57:28.663530Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Using linguistic information to classify Portuguese text documents |
title |
Using linguistic information to classify Portuguese text documents |
spellingShingle |
Using linguistic information to classify Portuguese text documents Teresa, Gonçalves Text classification Support vector machines Linguistic Information |
title_short |
Using linguistic information to classify Portuguese text documents |
title_full |
Using linguistic information to classify Portuguese text documents |
title_fullStr |
Using linguistic information to classify Portuguese text documents |
title_full_unstemmed |
Using linguistic information to classify Portuguese text documents |
title_sort |
Using linguistic information to classify Portuguese text documents |
author |
Teresa, Gonçalves |
author_facet |
Teresa, Gonçalves Paulo, Quaresma |
author_role |
author |
author2 |
Paulo, Quaresma |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Teresa, Gonçalves Paulo, Quaresma |
dc.subject.por.fl_str_mv |
Text classification Support vector machines Linguistic Information |
topic |
Text classification Support vector machines Linguistic Information |
description |
This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Público newspaper. The results show that sentences' syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents. |
publishDate |
2008 |
dc.date.none.fl_str_mv |
2008-10-01T00:00:00Z 2009-04-06T15:49:04Z 2009-04-06 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10174/1410 http://hdl.handle.net/10174/1410 |
url |
http://hdl.handle.net/10174/1410 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
94-100 978-0-7695-3441-1 1 restrito_ue tcg@di.uevora.pt pq@di.uevora.pt 7th Mexican International Conference on Artificial Intelligence Gelbukh, Alexander Morales, Eduardo 283 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
251581 bytes application/pdf |
dc.publisher.none.fl_str_mv |
IEEE Computer Society |
publisher.none.fl_str_mv |
IEEE Computer Society |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799136458677682176 |