From source code identifiers to natural language terms
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/1822/53525 |
Resumo: | Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented. |
id |
RCAP_c68e92d4386a23afa3578b768dc7981d |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/53525 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
From source code identifiers to natural language termsProgram comprehensionNatural language processingIdentifier splittingScience & TechnologyProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.This work is funded by National Funds through the FCT – Fundac¸ ão para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project PEst-OE/EEI/UI0752/2014.info:eu-repo/semantics/publishedVersionElsevierUniversidade do MinhoCarvalho, Nuno Alexandre RamosAlmeida, J. J.Henriques, Pedro RangelVaranda, Maria João20152015-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/1822/53525eng0164-12121873-122810.1016/j.jss.2014.10.013info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T11:59:30Zoai:repositorium.sdum.uminho.pt:1822/53525Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T18:49:17.963952Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
From source code identifiers to natural language terms |
title |
From source code identifiers to natural language terms |
spellingShingle |
From source code identifiers to natural language terms Carvalho, Nuno Alexandre Ramos Program comprehension Natural language processing Identifier splitting Science & Technology |
title_short |
From source code identifiers to natural language terms |
title_full |
From source code identifiers to natural language terms |
title_fullStr |
From source code identifiers to natural language terms |
title_full_unstemmed |
From source code identifiers to natural language terms |
title_sort |
From source code identifiers to natural language terms |
author |
Carvalho, Nuno Alexandre Ramos |
author_facet |
Carvalho, Nuno Alexandre Ramos Almeida, J. J. Henriques, Pedro Rangel Varanda, Maria João |
author_role |
author |
author2 |
Almeida, J. J. Henriques, Pedro Rangel Varanda, Maria João |
author2_role |
author author author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Carvalho, Nuno Alexandre Ramos Almeida, J. J. Henriques, Pedro Rangel Varanda, Maria João |
dc.subject.por.fl_str_mv |
Program comprehension Natural language processing Identifier splitting Science & Technology |
topic |
Program comprehension Natural language processing Identifier splitting Science & Technology |
description |
Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented. |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015 2015-01-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1822/53525 |
url |
http://hdl.handle.net/1822/53525 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
0164-1212 1873-1228 10.1016/j.jss.2014.10.013 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Elsevier |
publisher.none.fl_str_mv |
Elsevier |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799132257376534528 |