Optimizing ahocorasick for word counting.
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Trabalho de conclusão de curso |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da UFCG |
Texto Completo: | http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 |
Resumo: | The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick. |
id |
UFCG_68a2dcf2d82779f2982edf19ff17f643 |
---|---|
oai_identifier_str |
oai:localhost:riufcg/20128 |
network_acronym_str |
UFCG |
network_name_str |
Biblioteca Digital de Teses e Dissertações da UFCG |
repository_id_str |
4851 |
spelling |
Optimizing ahocorasick for word counting.Otimizando ahocorasick para contagem de palavras.Aho-Corasick algoritmPattern matchingCorrespondência de padrõesFiltrageCoincidencia de patronesWord countingRecuento de palabrasComptage de motsContagem de palavrasAlgoritmo offlineAlgorithme hors ligneAlgoritmo sin conexiónOffline algorithmProcessamento de textosProcessing of textsProcesamiento de textosTraitement des textesCiência da ComputaçãoThe Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.Universidade Federal de Campina GrandeBrasilCentro de Engenharia Elétrica e Informática - CEEIUFCGGHEYI, Rohit.GHEYI, R.http://lattes.cnpq.br/2931270888717344MONTEIRO , João Arthur Brunet.MASSONI , Tiago Lima.LUCENA, Emerson Leonardo.20202021-07-20T13:15:32Z2021-07-202021-07-20T13:15:32Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesishttp://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128enginfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFCGinstname:Universidade Federal de Campina Grande (UFCG)instacron:UFCG2021-08-04T16:36:55Zoai:localhost:riufcg/20128Biblioteca Digital de Teses e Dissertaçõeshttp://bdtd.ufcg.edu.br/PUBhttp://dspace.sti.ufcg.edu.br:8080/oai/requestbdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.bropendoar:48512021-08-04T16:36:55Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG)false |
dc.title.none.fl_str_mv |
Optimizing ahocorasick for word counting. Otimizando ahocorasick para contagem de palavras. |
title |
Optimizing ahocorasick for word counting. |
spellingShingle |
Optimizing ahocorasick for word counting. LUCENA, Emerson Leonardo. Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes Ciência da Computação |
title_short |
Optimizing ahocorasick for word counting. |
title_full |
Optimizing ahocorasick for word counting. |
title_fullStr |
Optimizing ahocorasick for word counting. |
title_full_unstemmed |
Optimizing ahocorasick for word counting. |
title_sort |
Optimizing ahocorasick for word counting. |
author |
LUCENA, Emerson Leonardo. |
author_facet |
LUCENA, Emerson Leonardo. |
author_role |
author |
dc.contributor.none.fl_str_mv |
GHEYI, Rohit. GHEYI, R. http://lattes.cnpq.br/2931270888717344 MONTEIRO , João Arthur Brunet. MASSONI , Tiago Lima. |
dc.contributor.author.fl_str_mv |
LUCENA, Emerson Leonardo. |
dc.subject.por.fl_str_mv |
Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes Ciência da Computação |
topic |
Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes Ciência da Computação |
description |
The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020 2021-07-20T13:15:32Z 2021-07-20 2021-07-20T13:15:32Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/bachelorThesis |
format |
bachelorThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 |
url |
http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 |
identifier_str_mv |
LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Campina Grande Brasil Centro de Engenharia Elétrica e Informática - CEEI UFCG |
publisher.none.fl_str_mv |
Universidade Federal de Campina Grande Brasil Centro de Engenharia Elétrica e Informática - CEEI UFCG |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da UFCG instname:Universidade Federal de Campina Grande (UFCG) instacron:UFCG |
instname_str |
Universidade Federal de Campina Grande (UFCG) |
instacron_str |
UFCG |
institution |
UFCG |
reponame_str |
Biblioteca Digital de Teses e Dissertações da UFCG |
collection |
Biblioteca Digital de Teses e Dissertações da UFCG |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG) |
repository.mail.fl_str_mv |
bdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.br |
_version_ |
1809744501020819456 |