Android app for Automatic Web Page Classification : Analysis of Text and Visual Features
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10316/41703 |
Resumo: | Internet keeps growing everyday and with that, the creation of new web pages. Due to this fact, web pages of many different categories can be found such as News, Sports or Business. This issue has made investigators think about one innovative concept: Webpage Classification. This new approach implies the categorization of web pages to one or more category labels. Some research has been done during the last years using text and visual content extracted from the web pages to be able to classify. However, the need of being able to do such a thing in an Android app has not been investigated yet, to the best of our knowledge. Consequently, this thesis is focused in the development of an Android app which is able to classify web pages. First of all, text and visual features have to be extracted from each webpage. Four types of visual features were extracted from each web page to construct a visual features vector of 160 attributes. Concerning to the text features, a text features vector was also built for each of the webpage with 160 attributes. To do so, a “Bag-Of-Words” of one hundred and sixty words was set up from the HTML code already extracted and filtered. Thus, we end up having a full vector of 320 attributes for each webpage. A binary classification was performed trying to distinguish web pages for Adults and for Kids. Good results were obtained especially when using AdaBoost classifier with text and visual features where a 94.44% of accuracy of correct classifications was achieved. |
id |
RCAP_636a30bc8ff825125b3692b9514a10bb |
---|---|
oai_identifier_str |
oai:estudogeral.uc.pt:10316/41703 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Android app for Automatic Web Page Classification : Analysis of Text and Visual FeaturesPáginas webClassificaçãoAplicação AndroidDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaInternet keeps growing everyday and with that, the creation of new web pages. Due to this fact, web pages of many different categories can be found such as News, Sports or Business. This issue has made investigators think about one innovative concept: Webpage Classification. This new approach implies the categorization of web pages to one or more category labels. Some research has been done during the last years using text and visual content extracted from the web pages to be able to classify. However, the need of being able to do such a thing in an Android app has not been investigated yet, to the best of our knowledge. Consequently, this thesis is focused in the development of an Android app which is able to classify web pages. First of all, text and visual features have to be extracted from each webpage. Four types of visual features were extracted from each web page to construct a visual features vector of 160 attributes. Concerning to the text features, a text features vector was also built for each of the webpage with 160 attributes. To do so, a “Bag-Of-Words” of one hundred and sixty words was set up from the HTML code already extracted and filtered. Thus, we end up having a full vector of 320 attributes for each webpage. A binary classification was performed trying to distinguish web pages for Adults and for Kids. Good results were obtained especially when using AdaBoost classifier with text and visual features where a 94.44% of accuracy of correct classifications was achieved.2015-07info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://hdl.handle.net/10316/41703http://hdl.handle.net/10316/41703engUgalde, Diego Salasinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2022-01-21T17:17:06Zoai:estudogeral.uc.pt:10316/41703Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:58:16.267158Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
title |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
spellingShingle |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features Ugalde, Diego Salas Páginas web Classificação Aplicação Android Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
title_full |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
title_fullStr |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
title_full_unstemmed |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
title_sort |
Android app for Automatic Web Page Classification : Analysis of Text and Visual Features |
author |
Ugalde, Diego Salas |
author_facet |
Ugalde, Diego Salas |
author_role |
author |
dc.contributor.author.fl_str_mv |
Ugalde, Diego Salas |
dc.subject.por.fl_str_mv |
Páginas web Classificação Aplicação Android Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
Páginas web Classificação Aplicação Android Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
Internet keeps growing everyday and with that, the creation of new web pages. Due to this fact, web pages of many different categories can be found such as News, Sports or Business. This issue has made investigators think about one innovative concept: Webpage Classification. This new approach implies the categorization of web pages to one or more category labels. Some research has been done during the last years using text and visual content extracted from the web pages to be able to classify. However, the need of being able to do such a thing in an Android app has not been investigated yet, to the best of our knowledge. Consequently, this thesis is focused in the development of an Android app which is able to classify web pages. First of all, text and visual features have to be extracted from each webpage. Four types of visual features were extracted from each web page to construct a visual features vector of 160 attributes. Concerning to the text features, a text features vector was also built for each of the webpage with 160 attributes. To do so, a “Bag-Of-Words” of one hundred and sixty words was set up from the HTML code already extracted and filtered. Thus, we end up having a full vector of 320 attributes for each webpage. A binary classification was performed trying to distinguish web pages for Adults and for Kids. Good results were obtained especially when using AdaBoost classifier with text and visual features where a 94.44% of accuracy of correct classifications was achieved. |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015-07 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10316/41703 http://hdl.handle.net/10316/41703 |
url |
http://hdl.handle.net/10316/41703 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799133873023483904 |