Optimizing aho­corasick for word counting.

Detalhes bibliográficos
Autor(a) principal: LUCENA, Emerson Leonardo.
Data de Publicação: 2020
Tipo de documento: Trabalho de conclusão de curso
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da UFCG
Texto Completo: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
Resumo: The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.
id UFCG_68a2dcf2d82779f2982edf19ff17f643
oai_identifier_str oai:localhost:riufcg/20128
network_acronym_str UFCG
network_name_str Biblioteca Digital de Teses e Dissertações da UFCG
repository_id_str 4851
spelling Optimizing aho­corasick for word counting.Otimizando ahocorasick para contagem de palavras.Aho-Corasick algoritmPattern matchingCorrespondência de padrõesFiltrageCoincidencia de patronesWord countingRecuento de palabrasComptage de motsContagem de palavrasAlgoritmo offlineAlgorithme hors ligneAlgoritmo sin conexiónOffline algorithmProcessamento de textosProcessing of textsProcesamiento de textosTraitement des textesCiência da ComputaçãoThe Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.Universidade Federal de Campina GrandeBrasilCentro de Engenharia Elétrica e Informática - CEEIUFCGGHEYI, Rohit.GHEYI, R.http://lattes.cnpq.br/2931270888717344MONTEIRO , João Arthur Brunet.MASSONI , Tiago Lima.LUCENA, Emerson Leonardo.20202021-07-20T13:15:32Z2021-07-202021-07-20T13:15:32Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesishttp://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128enginfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFCGinstname:Universidade Federal de Campina Grande (UFCG)instacron:UFCG2021-08-04T16:36:55Zoai:localhost:riufcg/20128Biblioteca Digital de Teses e Dissertaçõeshttp://bdtd.ufcg.edu.br/PUBhttp://dspace.sti.ufcg.edu.br:8080/oai/requestbdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.bropendoar:48512021-08-04T16:36:55Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG)false
dc.title.none.fl_str_mv Optimizing aho­corasick for word counting.
Otimizando ahocorasick para contagem de palavras.
title Optimizing aho­corasick for word counting.
spellingShingle Optimizing aho­corasick for word counting.
LUCENA, Emerson Leonardo.
Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
Ciência da Computação
title_short Optimizing aho­corasick for word counting.
title_full Optimizing aho­corasick for word counting.
title_fullStr Optimizing aho­corasick for word counting.
title_full_unstemmed Optimizing aho­corasick for word counting.
title_sort Optimizing aho­corasick for word counting.
author LUCENA, Emerson Leonardo.
author_facet LUCENA, Emerson Leonardo.
author_role author
dc.contributor.none.fl_str_mv GHEYI, Rohit.
GHEYI, R.
http://lattes.cnpq.br/2931270888717344
MONTEIRO , João Arthur Brunet.
MASSONI , Tiago Lima.
dc.contributor.author.fl_str_mv LUCENA, Emerson Leonardo.
dc.subject.por.fl_str_mv Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
Ciência da Computação
topic Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
Ciência da Computação
description The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.
publishDate 2020
dc.date.none.fl_str_mv 2020
2021-07-20T13:15:32Z
2021-07-20
2021-07-20T13:15:32Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/bachelorThesis
format bachelorThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
url http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
identifier_str_mv LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Campina Grande
Brasil
Centro de Engenharia Elétrica e Informática - CEEI
UFCG
publisher.none.fl_str_mv Universidade Federal de Campina Grande
Brasil
Centro de Engenharia Elétrica e Informática - CEEI
UFCG
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da UFCG
instname:Universidade Federal de Campina Grande (UFCG)
instacron:UFCG
instname_str Universidade Federal de Campina Grande (UFCG)
instacron_str UFCG
institution UFCG
reponame_str Biblioteca Digital de Teses e Dissertações da UFCG
collection Biblioteca Digital de Teses e Dissertações da UFCG
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG)
repository.mail.fl_str_mv bdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.br
_version_ 1809744501020819456