Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados

Sobreira, Victor

Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados

Detalhes bibliográficos
Autor(a) principal:	Sobreira, Victor
Data de Publicação:	2022
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Repositório Institucional da UFU
Texto Completo:	https://repositorio.ufu.br/handle/123456789/35654 http://doi.org/10.14393/ufu.te.2022.60
Resumo:	Finding and fixing software bugs still is a big challenge. These tasks demand developers as much effort and experience as required to develop new functionality. Last decades, the research community actively produced approaches to support the debugging process. The Bug Localization (BL) task is an essential step, wherever is the applied software repair approach (automated or manual). However, automated techniques for BL are critical in turning the process more effective and efficient. There are many approaches to automated BL, and all of them have one frequent goal: to improve accuracy performance in classifying software components suspected of containing bugs. One recurrent issue is the lack of clarity about the reasons for the success or failure of the approaches on the assessed bug dataset since most methods do not consider the nature and intrinsic characteristics of the bugs. The discussion is still too focused on performance gains compared to the previous state-of-the-art. This work aims to contribute to software repair tasks, primarily focusing on supporting the automated BL. First, we explored characteristics of bugs usually applied in the assessment of the localization strategies (also extended to automated program repair). Then, we analyze the relationships between these bug characteristics and their influence on the performance of localization strategies. We start from a static information-based BL approach, based in LtR algorithms, having bug reports as input to the localization process. Initially, we analyze a well-known bug dataset, Defects4J, from where we extract various bugs characteristics. Next, we analyzed these characteristics in a larger dataset referred to as LR-dataset. Then, we raise various strategies and alternatives to improve the ranking of suspect buggy files and generated by BL approaches. Some examples are the use of new features (e.g., Code Entropy), the tuning of hyperparameters and the data balance for training in Machine Learning (ML) based approaches, and, finally, bugs' sampling guided by patch analysis. For that, we tested the alternatives to improve the ranking of suspected components with an environment built for experimenting with and reproducing the BL strategies. We show that pre-processing strategies on bug reports and also on the dataset, besides the tuning of different LtR algorithms, can produce different ranking results even with past BL approaches. Still, characteristics of the bugs sampled for assessment can influence ranking scores of buggy suspected files, e.g., depending on the type of associated repair patterns and repair actions required to fix the bugs. For example, this is the case for the Missing Not-Null Check repair pattern whose presence in an experimental sample produces a suspicious score ranking 27.22 percentual points above the baseline when we do not consider the presence (or absence) of the pattern. These results point to opportunities to review the BL past approaches under the lens of dataset dissection applied in the assessment and with a potential to new insights, interpretations, and compositions of strategies for BL.

Metadados do item

id	UFU_8263ef07c36a8000cdf1b2571a34aec0
oai_identifier_str	oai:repositorio.ufu.br:123456789/35654
network_acronym_str	UFU
network_name_str	Repositório Institucional da UFU
repository_id_str
spelling	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dadosAnalysis of bug localization performance supported by dataset dissectionLocalização de BugBug LocalizationReparo Automático de SoftwareAutomatic Program RepairDissecção de Conjuntos de Dados de BugsBugs’ Dataset DissectionDepuração de SoftwareDebuggingAnálise de ReparosPatch AnalysisAprendizado de RankingsLearn-to-RankCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO::ENGENHARIA DE SOFTWAREComputaçãoSoftware - ManutençãoFalhas de sistemas de computaçãoConjunto de caracteres (Processamento de dados)Finding and fixing software bugs still is a big challenge. These tasks demand developers as much effort and experience as required to develop new functionality. Last decades, the research community actively produced approaches to support the debugging process. The Bug Localization (BL) task is an essential step, wherever is the applied software repair approach (automated or manual). However, automated techniques for BL are critical in turning the process more effective and efficient. There are many approaches to automated BL, and all of them have one frequent goal: to improve accuracy performance in classifying software components suspected of containing bugs. One recurrent issue is the lack of clarity about the reasons for the success or failure of the approaches on the assessed bug dataset since most methods do not consider the nature and intrinsic characteristics of the bugs. The discussion is still too focused on performance gains compared to the previous state-of-the-art. This work aims to contribute to software repair tasks, primarily focusing on supporting the automated BL. First, we explored characteristics of bugs usually applied in the assessment of the localization strategies (also extended to automated program repair). Then, we analyze the relationships between these bug characteristics and their influence on the performance of localization strategies. We start from a static information-based BL approach, based in LtR algorithms, having bug reports as input to the localization process. Initially, we analyze a well-known bug dataset, Defects4J, from where we extract various bugs characteristics. Next, we analyzed these characteristics in a larger dataset referred to as LR-dataset. Then, we raise various strategies and alternatives to improve the ranking of suspect buggy files and generated by BL approaches. Some examples are the use of new features (e.g., Code Entropy), the tuning of hyperparameters and the data balance for training in Machine Learning (ML) based approaches, and, finally, bugs' sampling guided by patch analysis. For that, we tested the alternatives to improve the ranking of suspected components with an environment built for experimenting with and reproducing the BL strategies. We show that pre-processing strategies on bug reports and also on the dataset, besides the tuning of different LtR algorithms, can produce different ranking results even with past BL approaches. Still, characteristics of the bugs sampled for assessment can influence ranking scores of buggy suspected files, e.g., depending on the type of associated repair patterns and repair actions required to fix the bugs. For example, this is the case for the Missing Not-Null Check repair pattern whose presence in an experimental sample produces a suspicious score ranking 27.22 percentual points above the baseline when we do not consider the presence (or absence) of the pattern. These results point to opportunities to review the BL past approaches under the lens of dataset dissection applied in the assessment and with a potential to new insights, interpretations, and compositions of strategies for BL.UFU - Universidade Federal de UberlândiaTese (Doutorado)Encontrar e corrigir a causa de falhas em software continua sendo um grande desafio. Tais tarefas exigem dos desenvolvedores esforço e experiência equivalentes as necessárias para o desenvolvimento de novas funcionalidades. Nas últimas décadas, a comunidade de pesquisa esteve ativa na produção de abordagens para apoiar a depuração de software. A tarefa de Localização de Faltas (LF) é um passo essencial, independente da abordagem utilizada para reparo de programas (automática ou manual). Entretanto, as abordagens automatizadas de localização são críticas para tornar o processo mais eficaz e eficiente. Existem muitas abordagens para a LF automática e todas têm um alvo comum: melhorar a precisão do ranqueamento de componentes de software suspeitos de conter uma falta. Uma questão recorrente é a indefinição sobre as razões do sucesso ou fracasso das abordagens sobre o conjunto de dados de faltas avaliado, uma vez que a maioria dos métodos não considera a natureza e as características intrínsecas das faltas. A discussão ainda é muito focada em ganhos de desempenho nos comparativos com o estado da arte. Este trabalho visa apoiar as tarefas de reparo de software, com foco primário no suporte automatizado à LF. Primeiro, investigamos as características associadas as faltas comumente utilizadas na avaliação de estratégias de LF (o que se estende também ao reparo automático de programas). Então, analisamos as relações entre essas características e como influenciam a performance da LF. Partimos de uma abordagem estática de LF, baseada em algoritmos de aprendizado de rankings, Learning to Rank (LtR), e tendo relatórios de bugs como entrada do processo. Inicialmente, analisamos um conhecido conjunto de dados de faltas, Defects4J, de onde extraímos várias características das faltas. Posteriormente, analisamos tais características em um conjunto de dados maior, o qual referenciamos como LR–dataset. Então, levantamos várias estratégias e alternativas para a melhoria dos rankings de arquivos suspeitos de falta e gerados por abordagens de LF. Por exemplo, o uso de novas características (como a Entropia do Código), o ajuste de hiper-parâmetros e o balanceamento de dados para treinamento em abordagens de aprendizado de máquina e, finalmente, a amostragem de falhas guiada pela análise de códigos de reparo. Para isso, testamos as alternativas para melhoria dos rankings de componentes suspeitos por meio de um ambiente construído para experimentação e reprodução de estratégias para a LF. Mostramos que as estratégias de pré-processamento de relatórios de bugs e dos conjuntos de dados, além do ajuste de diferentes algoritmos de LtR, podem produzir resultados diferentes para os rankings mesmo usando abordagens prévias de LF. Além disso, as características das falhas amostradas para a avaliação podem influenciar significativamente o ranqueamento dos arquivos suspeitos, por exemplo, dependendo do tipo de padrões e ações de reparo necessários para a correção das falhas envolvidas. Este é o caso do padrão de reparo Missing Not-Null Check cuja presença em uma das amostras experimentais gerou um ranking de arquivos suspeitos marcando 27.22 pontos percentuais acima da linha base, ou seja, quando nós não consideramos a presença (ou ausência) do padrão. Esses resultados apontam para oportunidades de revisão das abordagens prévias de LF sob as lentes da dissecção dos conjuntos de dados utilizados na avaliação, com potencial de novos entendimentos, interpretações e composições de estratégias para LF.Universidade Federal de UberlândiaBrasilPrograma de Pós-graduação em Ciência da ComputaçãoMaia, Marcelo de Almeidahttp://lattes.cnpq.br/4915659948263445Figueiredo, Eduardohttp://lattes.cnpq.br/1265706528850746Dorça, Fabiano Azevedohttp://lattes.cnpq.br/3944579737930998Silva, Flávio de Oliveirahttp://lattes.cnpq.br/3190608911887258Kulesza, Uiráhttp://lattes.cnpq.br/0189095897739979Sobreira, Victor2022-08-22T19:40:30Z2022-08-22T19:40:30Z2022-01-24info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfSOBREIRA, Victor. Analysis of bug localization performance supported by dataset dissection. 2022. 222 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal de Uberlândia, Uberlândia, 2022. DOI http://doi.org/10.14393/ufu.te.2022.60https://repositorio.ufu.br/handle/123456789/35654http://doi.org/10.14393/ufu.te.2022.60enghttp://creativecommons.org/licenses/by-nc-nd/3.0/us/info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFUinstname:Universidade Federal de Uberlândia (UFU)instacron:UFU2022-08-23T06:27:21Zoai:repositorio.ufu.br:123456789/35654Repositório InstitucionalONGhttp://repositorio.ufu.br/oai/requestdiinf@dirbi.ufu.bropendoar:2022-08-23T06:27:21Repositório Institucional da UFU - Universidade Federal de Uberlândia (UFU)false
dc.title.none.fl_str_mv	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados Analysis of bug localization performance supported by dataset dissection
title	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados
spellingShingle	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados Sobreira, Victor Localização de Bug Bug Localization Reparo Automático de Software Automatic Program Repair Dissecção de Conjuntos de Dados de Bugs Bugs’ Dataset Dissection Depuração de Software Debugging Análise de Reparos Patch Analysis Aprendizado de Rankings Learn-to-Rank CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO::ENGENHARIA DE SOFTWARE Computação Software - Manutenção Falhas de sistemas de computação Conjunto de caracteres (Processamento de dados)
title_short	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados
title_full	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados
title_fullStr	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados
title_full_unstemmed	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados
title_sort	Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados
author	Sobreira, Victor
author_facet	Sobreira, Victor
author_role	author
dc.contributor.none.fl_str_mv	Maia, Marcelo de Almeida http://lattes.cnpq.br/4915659948263445 Figueiredo, Eduardo http://lattes.cnpq.br/1265706528850746 Dorça, Fabiano Azevedo http://lattes.cnpq.br/3944579737930998 Silva, Flávio de Oliveira http://lattes.cnpq.br/3190608911887258 Kulesza, Uirá http://lattes.cnpq.br/0189095897739979
dc.contributor.author.fl_str_mv	Sobreira, Victor
dc.subject.por.fl_str_mv	Localização de Bug Bug Localization Reparo Automático de Software Automatic Program Repair Dissecção de Conjuntos de Dados de Bugs Bugs’ Dataset Dissection Depuração de Software Debugging Análise de Reparos Patch Analysis Aprendizado de Rankings Learn-to-Rank CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO::ENGENHARIA DE SOFTWARE Computação Software - Manutenção Falhas de sistemas de computação Conjunto de caracteres (Processamento de dados)
topic	Localização de Bug Bug Localization Reparo Automático de Software Automatic Program Repair Dissecção de Conjuntos de Dados de Bugs Bugs’ Dataset Dissection Depuração de Software Debugging Análise de Reparos Patch Analysis Aprendizado de Rankings Learn-to-Rank CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO::ENGENHARIA DE SOFTWARE Computação Software - Manutenção Falhas de sistemas de computação Conjunto de caracteres (Processamento de dados)
description	Finding and fixing software bugs still is a big challenge. These tasks demand developers as much effort and experience as required to develop new functionality. Last decades, the research community actively produced approaches to support the debugging process. The Bug Localization (BL) task is an essential step, wherever is the applied software repair approach (automated or manual). However, automated techniques for BL are critical in turning the process more effective and efficient. There are many approaches to automated BL, and all of them have one frequent goal: to improve accuracy performance in classifying software components suspected of containing bugs. One recurrent issue is the lack of clarity about the reasons for the success or failure of the approaches on the assessed bug dataset since most methods do not consider the nature and intrinsic characteristics of the bugs. The discussion is still too focused on performance gains compared to the previous state-of-the-art. This work aims to contribute to software repair tasks, primarily focusing on supporting the automated BL. First, we explored characteristics of bugs usually applied in the assessment of the localization strategies (also extended to automated program repair). Then, we analyze the relationships between these bug characteristics and their influence on the performance of localization strategies. We start from a static information-based BL approach, based in LtR algorithms, having bug reports as input to the localization process. Initially, we analyze a well-known bug dataset, Defects4J, from where we extract various bugs characteristics. Next, we analyzed these characteristics in a larger dataset referred to as LR-dataset. Then, we raise various strategies and alternatives to improve the ranking of suspect buggy files and generated by BL approaches. Some examples are the use of new features (e.g., Code Entropy), the tuning of hyperparameters and the data balance for training in Machine Learning (ML) based approaches, and, finally, bugs' sampling guided by patch analysis. For that, we tested the alternatives to improve the ranking of suspected components with an environment built for experimenting with and reproducing the BL strategies. We show that pre-processing strategies on bug reports and also on the dataset, besides the tuning of different LtR algorithms, can produce different ranking results even with past BL approaches. Still, characteristics of the bugs sampled for assessment can influence ranking scores of buggy suspected files, e.g., depending on the type of associated repair patterns and repair actions required to fix the bugs. For example, this is the case for the Missing Not-Null Check repair pattern whose presence in an experimental sample produces a suspicious score ranking 27.22 percentual points above the baseline when we do not consider the presence (or absence) of the pattern. These results point to opportunities to review the BL past approaches under the lens of dataset dissection applied in the assessment and with a potential to new insights, interpretations, and compositions of strategies for BL.
publishDate	2022
dc.date.none.fl_str_mv	2022-08-22T19:40:30Z 2022-08-22T19:40:30Z 2022-01-24
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	SOBREIRA, Victor. Analysis of bug localization performance supported by dataset dissection. 2022. 222 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal de Uberlândia, Uberlândia, 2022. DOI http://doi.org/10.14393/ufu.te.2022.60 https://repositorio.ufu.br/handle/123456789/35654 http://doi.org/10.14393/ufu.te.2022.60
identifier_str_mv	SOBREIRA, Victor. Analysis of bug localization performance supported by dataset dissection. 2022. 222 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal de Uberlândia, Uberlândia, 2022. DOI http://doi.org/10.14393/ufu.te.2022.60
url	https://repositorio.ufu.br/handle/123456789/35654 http://doi.org/10.14393/ufu.te.2022.60
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	http://creativecommons.org/licenses/by-nc-nd/3.0/us/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-nd/3.0/us/
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de Uberlândia Brasil Programa de Pós-graduação em Ciência da Computação
publisher.none.fl_str_mv	Universidade Federal de Uberlândia Brasil Programa de Pós-graduação em Ciência da Computação
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFU instname:Universidade Federal de Uberlândia (UFU) instacron:UFU
instname_str	Universidade Federal de Uberlândia (UFU)
instacron_str	UFU
institution	UFU
reponame_str	Repositório Institucional da UFU
collection	Repositório Institucional da UFU
repository.name.fl_str_mv	Repositório Institucional da UFU - Universidade Federal de Uberlândia (UFU)
repository.mail.fl_str_mv	diinf@dirbi.ufu.br
_version_	1813711547105542144

Análise de performance na localização de bugs apoiada pela dissecção de conjuntos de dados

Registros relacionados