Geographical partition for distributed web crawling

Detalhes bibliográficos
Autor(a) principal: Exposto, José
Data de Publicação: 2005
Outros Autores: Macedo, Joaquim, Pina, António Manuel Silva, Alves, Albano Agostinho Gomes, Amaro, José Carlos Rufino
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/1822/6321
Resumo: This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.
id RCAP_a548e47d3a27e7fe54214e64612a07a7
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/6321
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Geographical partition for distributed web crawlingWeb MiningParallel CrawlingWeb PartitioningThis paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.Fundação para a Ciência e a Tecnologia (FCT) - POSI/CHS/41739/2001Association for Computing MachineryUniversidade do MinhoExposto, JoséMacedo, JoaquimPina, António Manuel SilvaAlves, Albano Agostinho GomesAmaro, José Carlos Rufino20052005-01-01T00:00:00Zconference paperinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/1822/6321engHerzog, Otthein [et. al], ed. lit. – “Proceedings of the 2005 ACM CIKM : International Conference on Information and Knowledge Management, Bremen, Germany, 2005." New York : ACM Press, 2005. ISBN 1-59593-140-6.1-59593-140-610.1145/1096985.1096999info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-05-11T07:29:02Zoai:repositorium.sdum.uminho.pt:1822/6321Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-05-11T07:29:02Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Geographical partition for distributed web crawling
title Geographical partition for distributed web crawling
spellingShingle Geographical partition for distributed web crawling
Exposto, José
Web Mining
Parallel Crawling
Web Partitioning
title_short Geographical partition for distributed web crawling
title_full Geographical partition for distributed web crawling
title_fullStr Geographical partition for distributed web crawling
title_full_unstemmed Geographical partition for distributed web crawling
title_sort Geographical partition for distributed web crawling
author Exposto, José
author_facet Exposto, José
Macedo, Joaquim
Pina, António Manuel Silva
Alves, Albano Agostinho Gomes
Amaro, José Carlos Rufino
author_role author
author2 Macedo, Joaquim
Pina, António Manuel Silva
Alves, Albano Agostinho Gomes
Amaro, José Carlos Rufino
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Exposto, José
Macedo, Joaquim
Pina, António Manuel Silva
Alves, Albano Agostinho Gomes
Amaro, José Carlos Rufino
dc.subject.por.fl_str_mv Web Mining
Parallel Crawling
Web Partitioning
topic Web Mining
Parallel Crawling
Web Partitioning
description This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.
publishDate 2005
dc.date.none.fl_str_mv 2005
2005-01-01T00:00:00Z
dc.type.driver.fl_str_mv conference paper
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1822/6321
url http://hdl.handle.net/1822/6321
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Herzog, Otthein [et. al], ed. lit. – “Proceedings of the 2005 ACM CIKM : International Conference on Information and Knowledge Management, Bremen, Germany, 2005." New York : ACM Press, 2005. ISBN 1-59593-140-6.
1-59593-140-6
10.1145/1096985.1096999
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Association for Computing Machinery
publisher.none.fl_str_mv Association for Computing Machinery
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv mluisa.alvim@gmail.com
_version_ 1817545333230534656