Syntactic similarity of web documents.
Autor(a) principal: | |
---|---|
Data de Publicação: | 2003 |
Outros Autores: | |
Tipo de documento: | Artigo de conferência |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFOP |
Texto Completo: | http://www.repositorio.ufop.br/handle/123456789/1682 |
Resumo: | This paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original doc-ument and some candidates, two methods find documents that have some similarity relationship with the original doc-ument. Experimental results were obtained by using a pla-giarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic ave rage of the absolute differences between the expected and ob-tained similarity, the algorithm that uses shingles obtained a performance of 4,13 % and the algorithm that uses Patricia tree a performance 7.50% |
id |
UFOP_e659dd1e82938aeddd8e9b83596abce0 |
---|---|
oai_identifier_str |
oai:localhost:123456789/1682 |
network_acronym_str |
UFOP |
network_name_str |
Repositório Institucional da UFOP |
repository_id_str |
3233 |
spelling |
Pereira Junior, Álvaro RodriguesZiviani, Nivio2012-10-18T20:49:54Z2012-10-18T20:49:54Z2003PEREIRA JUNIOR, A. R.; ZIVIANI, N. Syntactic similarity of web documents. In. Latin American Web Congress, 1 . 2003. Santiago . Anais... Santiago: Latin American Web Congress, 2003. v. 1. p. 194-200. Disponível em: <http://www.cwr.cl/la-web/2003/stamped/23_pereira_a.pdf>. Acesso em: 18 out. 2012.http://www.repositorio.ufop.br/handle/123456789/1682This paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original doc-ument and some candidates, two methods find documents that have some similarity relationship with the original doc-ument. Experimental results were obtained by using a pla-giarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic ave rage of the absolute differences between the expected and ob-tained similarity, the algorithm that uses shingles obtained a performance of 4,13 % and the algorithm that uses Patricia tree a performance 7.50%Syntactic similarity of web documents.info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/conferenceObjectengreponame:Repositório Institucional da UFOPinstname:Universidade Federal de Ouro Preto (UFOP)instacron:UFOPinfo:eu-repo/semantics/openAccessLICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://www.repositorio.ufop.br/bitstream/123456789/1682/5/license.txt8a4605be74aa9ea9d79846c1fba20a33MD55ORIGINALEVENTO_SyntacticSimilarityWeb.pdfEVENTO_SyntacticSimilarityWeb.pdfapplication/pdf226706http://www.repositorio.ufop.br/bitstream/123456789/1682/1/EVENTO_SyntacticSimilarityWeb.pdf9fde87a086007c7f651421fb4a8d3aaeMD51123456789/16822017-01-05 07:57:11.525oai:localhost:123456789/1682Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufop.br/oai/requestrepositorio@ufop.edu.bropendoar:32332017-01-05T12:57:11Repositório Institucional da UFOP - Universidade Federal de Ouro Preto (UFOP)false |
dc.title.pt_BR.fl_str_mv |
Syntactic similarity of web documents. |
title |
Syntactic similarity of web documents. |
spellingShingle |
Syntactic similarity of web documents. Pereira Junior, Álvaro Rodrigues |
title_short |
Syntactic similarity of web documents. |
title_full |
Syntactic similarity of web documents. |
title_fullStr |
Syntactic similarity of web documents. |
title_full_unstemmed |
Syntactic similarity of web documents. |
title_sort |
Syntactic similarity of web documents. |
author |
Pereira Junior, Álvaro Rodrigues |
author_facet |
Pereira Junior, Álvaro Rodrigues Ziviani, Nivio |
author_role |
author |
author2 |
Ziviani, Nivio |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Pereira Junior, Álvaro Rodrigues Ziviani, Nivio |
description |
This paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original doc-ument and some candidates, two methods find documents that have some similarity relationship with the original doc-ument. Experimental results were obtained by using a pla-giarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic ave rage of the absolute differences between the expected and ob-tained similarity, the algorithm that uses shingles obtained a performance of 4,13 % and the algorithm that uses Patricia tree a performance 7.50% |
publishDate |
2003 |
dc.date.issued.fl_str_mv |
2003 |
dc.date.accessioned.fl_str_mv |
2012-10-18T20:49:54Z |
dc.date.available.fl_str_mv |
2012-10-18T20:49:54Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/conferenceObject |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
PEREIRA JUNIOR, A. R.; ZIVIANI, N. Syntactic similarity of web documents. In. Latin American Web Congress, 1 . 2003. Santiago . Anais... Santiago: Latin American Web Congress, 2003. v. 1. p. 194-200. Disponível em: <http://www.cwr.cl/la-web/2003/stamped/23_pereira_a.pdf>. Acesso em: 18 out. 2012. |
dc.identifier.uri.fl_str_mv |
http://www.repositorio.ufop.br/handle/123456789/1682 |
identifier_str_mv |
PEREIRA JUNIOR, A. R.; ZIVIANI, N. Syntactic similarity of web documents. In. Latin American Web Congress, 1 . 2003. Santiago . Anais... Santiago: Latin American Web Congress, 2003. v. 1. p. 194-200. Disponível em: <http://www.cwr.cl/la-web/2003/stamped/23_pereira_a.pdf>. Acesso em: 18 out. 2012. |
url |
http://www.repositorio.ufop.br/handle/123456789/1682 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFOP instname:Universidade Federal de Ouro Preto (UFOP) instacron:UFOP |
instname_str |
Universidade Federal de Ouro Preto (UFOP) |
instacron_str |
UFOP |
institution |
UFOP |
reponame_str |
Repositório Institucional da UFOP |
collection |
Repositório Institucional da UFOP |
bitstream.url.fl_str_mv |
http://www.repositorio.ufop.br/bitstream/123456789/1682/5/license.txt http://www.repositorio.ufop.br/bitstream/123456789/1682/1/EVENTO_SyntacticSimilarityWeb.pdf |
bitstream.checksum.fl_str_mv |
8a4605be74aa9ea9d79846c1fba20a33 9fde87a086007c7f651421fb4a8d3aae |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFOP - Universidade Federal de Ouro Preto (UFOP) |
repository.mail.fl_str_mv |
repositorio@ufop.edu.br |
_version_ |
1801685729631272960 |