Declarative approach to data extraction of web pages

Detalhes bibliográficos
Autor(a) principal: Alves, Ricardo João de Freitas
Data de Publicação: 2009
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/5822
Resumo: Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science
id RCAP_0c3a067e877fd143c6f7c17d43632f62
oai_identifier_str oai:run.unl.pt:10362/5822
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Declarative approach to data extraction of web pagesWeb wrappersWeb data extractionGraphical wrapper constructionThesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer ScienceIn the last few years, we have been witnessing a noticeable WEB evolution with the introduction of significant improvements at technological level, such as the emergence of XHTML, CSS,Javascript, and Web2.0, just to name ones. This, combined with other factors such as physical expansion of the Web, as well as its low cost, have been the great motivator for the organizations and the general public to join, with a consequent growth in the number of users and thus influencing the volume of the largest global data repository. In consequence, there was an increasing need for regular data acquisition from the WEB, and because of its frequency, length or complexity, it would only be viable to obtain through automatic extractors. However, two main difficulties are inherent to automatic extractors. First, much of the Web's information is presented in visual formats mainly directed for human reading. Secondly, the introduction of dynamic webpages, which are brought together in local memory from different sources, causing some pages not to have a source file. Therefore, this thesis proposes a new and more modern extractor, capable of supporting the Web evolution, as well as being generic, so as to be able to be used in any situation, and capable of being extended and easily adaptable to a more particular use. This project is an extension of an earlier one which had the capability of extractions on semi-structured text files. However it evolved to a modular extraction system capable of extracting data from webpages, semi-structured text files and be expanded to support other data source types. It also contains a more complete and generic validation system and a new data delivery system capable of performing the earlier deliveries as well as new generic ones. A graphical editor was also developed to support the extraction system features and to allow a domain expert without computer knowledge to create extractions with only a few simple and intuitive interactions on the rendered webpage.Faculdade de Ciências e TecnologiaPires, JoãoRUNAlves, Ricardo João de Freitas2011-06-20T15:22:10Z20092009-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/5822enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T03:36:43Zoai:run.unl.pt:10362/5822Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:16:34.675218Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Declarative approach to data extraction of web pages
title Declarative approach to data extraction of web pages
spellingShingle Declarative approach to data extraction of web pages
Alves, Ricardo João de Freitas
Web wrappers
Web data extraction
Graphical wrapper construction
title_short Declarative approach to data extraction of web pages
title_full Declarative approach to data extraction of web pages
title_fullStr Declarative approach to data extraction of web pages
title_full_unstemmed Declarative approach to data extraction of web pages
title_sort Declarative approach to data extraction of web pages
author Alves, Ricardo João de Freitas
author_facet Alves, Ricardo João de Freitas
author_role author
dc.contributor.none.fl_str_mv Pires, João
RUN
dc.contributor.author.fl_str_mv Alves, Ricardo João de Freitas
dc.subject.por.fl_str_mv Web wrappers
Web data extraction
Graphical wrapper construction
topic Web wrappers
Web data extraction
Graphical wrapper construction
description Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science
publishDate 2009
dc.date.none.fl_str_mv 2009
2009-01-01T00:00:00Z
2011-06-20T15:22:10Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/5822
url http://hdl.handle.net/10362/5822
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Faculdade de Ciências e Tecnologia
publisher.none.fl_str_mv Faculdade de Ciências e Tecnologia
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137814900637696