Geographical partition for distributed web crawling
Autor(a) principal: | |
---|---|
Data de Publicação: | 2005 |
Outros Autores: | , , , |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/1822/6321 |
Resumo: | This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies. |
id |
RCAP_a548e47d3a27e7fe54214e64612a07a7 |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/6321 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Geographical partition for distributed web crawlingWeb MiningParallel CrawlingWeb PartitioningThis paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.Fundação para a Ciência e a Tecnologia (FCT) - POSI/CHS/41739/2001Association for Computing MachineryUniversidade do MinhoExposto, JoséMacedo, JoaquimPina, António Manuel SilvaAlves, Albano Agostinho GomesAmaro, José Carlos Rufino20052005-01-01T00:00:00Zconference paperinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/1822/6321engHerzog, Otthein [et. al], ed. lit. – “Proceedings of the 2005 ACM CIKM : International Conference on Information and Knowledge Management, Bremen, Germany, 2005." New York : ACM Press, 2005. ISBN 1-59593-140-6.1-59593-140-610.1145/1096985.1096999info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-05-11T07:29:02Zoai:repositorium.sdum.uminho.pt:1822/6321Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-05-11T07:29:02Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Geographical partition for distributed web crawling |
title |
Geographical partition for distributed web crawling |
spellingShingle |
Geographical partition for distributed web crawling Exposto, José Web Mining Parallel Crawling Web Partitioning |
title_short |
Geographical partition for distributed web crawling |
title_full |
Geographical partition for distributed web crawling |
title_fullStr |
Geographical partition for distributed web crawling |
title_full_unstemmed |
Geographical partition for distributed web crawling |
title_sort |
Geographical partition for distributed web crawling |
author |
Exposto, José |
author_facet |
Exposto, José Macedo, Joaquim Pina, António Manuel Silva Alves, Albano Agostinho Gomes Amaro, José Carlos Rufino |
author_role |
author |
author2 |
Macedo, Joaquim Pina, António Manuel Silva Alves, Albano Agostinho Gomes Amaro, José Carlos Rufino |
author2_role |
author author author author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Exposto, José Macedo, Joaquim Pina, António Manuel Silva Alves, Albano Agostinho Gomes Amaro, José Carlos Rufino |
dc.subject.por.fl_str_mv |
Web Mining Parallel Crawling Web Partitioning |
topic |
Web Mining Parallel Crawling Web Partitioning |
description |
This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies. |
publishDate |
2005 |
dc.date.none.fl_str_mv |
2005 2005-01-01T00:00:00Z |
dc.type.driver.fl_str_mv |
conference paper |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1822/6321 |
url |
http://hdl.handle.net/1822/6321 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
Herzog, Otthein [et. al], ed. lit. – “Proceedings of the 2005 ACM CIKM : International Conference on Information and Knowledge Management, Bremen, Germany, 2005." New York : ACM Press, 2005. ISBN 1-59593-140-6. 1-59593-140-6 10.1145/1096985.1096999 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Association for Computing Machinery |
publisher.none.fl_str_mv |
Association for Computing Machinery |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
mluisa.alvim@gmail.com |
_version_ |
1817545333230534656 |