Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | https://www.teses.usp.br/teses/disponiveis/45/45134/tde-31082022-210254/ |
Resumo: | A version control system (VCS) is a tool that tracks and manages the changes made to a set of files over time. More broadly, VCS tools can also help to shape and manage collaboration flows, find and fix bugs, remember the motivations behind a given code change, etc. Although these tools can typically track any type of data, version control systems bring huge benefits to software projects and, as a result, have become standard practice in this field. Among the VCS tools available today, Git is the most popular among developers. This tool is currently being used to version control a variety of repositories, from small personal projects of a few megabytes in size to massive corporate repositories with more than 300 GB and 3.5 million files. For that reason, speed and scalability are among the top priorities for the Git development community. However, the performance of the tool sometimes falls short of what is desired on networked file systems (NFS), where input and output (I/O) operations tend to be more costly. In particular, one Git operation that suffers from these costs is checkout, which is responsible for restoring files from specific versions of a project. Various optimizations were employed on code related to the checkout operation over the years, but the sequential processing of files still carried a large time penalty for NFS, as well as being suboptimal for local file systems on SSDs. In this project, we worked to parallelize the Git checkout machinery, resulting in speedups of up to 4.5x on NFS and 3.6x on SSDs. We also study how parallelism affects the I/O tasks performed by the checkout operation on different machines and storage devices. The parallel checkout feature was incorporated into the upstream Git repository and made available to all users of the tool since version 2.32.0, which was released in June 2021. |
id |
USP_1827fa060adce38f54ffaf1a35e7517c |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-31082022-210254 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applicationsParalelizando o Git Checkout: um estudo de caso sobre paralelismo de E/S em aplicações desktopGitGitNetwork file systemsParalelismo em E/SParallel I/OParallel programmingProgramação paralelaSistemas de arquivos em redeSistemas de controle de versõesVersion control systemsA version control system (VCS) is a tool that tracks and manages the changes made to a set of files over time. More broadly, VCS tools can also help to shape and manage collaboration flows, find and fix bugs, remember the motivations behind a given code change, etc. Although these tools can typically track any type of data, version control systems bring huge benefits to software projects and, as a result, have become standard practice in this field. Among the VCS tools available today, Git is the most popular among developers. This tool is currently being used to version control a variety of repositories, from small personal projects of a few megabytes in size to massive corporate repositories with more than 300 GB and 3.5 million files. For that reason, speed and scalability are among the top priorities for the Git development community. However, the performance of the tool sometimes falls short of what is desired on networked file systems (NFS), where input and output (I/O) operations tend to be more costly. In particular, one Git operation that suffers from these costs is checkout, which is responsible for restoring files from specific versions of a project. Various optimizations were employed on code related to the checkout operation over the years, but the sequential processing of files still carried a large time penalty for NFS, as well as being suboptimal for local file systems on SSDs. In this project, we worked to parallelize the Git checkout machinery, resulting in speedups of up to 4.5x on NFS and 3.6x on SSDs. We also study how parallelism affects the I/O tasks performed by the checkout operation on different machines and storage devices. The parallel checkout feature was incorporated into the upstream Git repository and made available to all users of the tool since version 2.32.0, which was released in June 2021.Sistemas de controle de versões (SCV) são ferramentas que monitoraram e gerenciam as alterações feitas em um conjunto de arquivos ao longo do tempo. De forma mais abrangente, SCVs também podem contribuir para moldar e gerir fluxos de colaboração, encontrar e corrigir bugs, relembrar as motivações por trás de determinada alteração de código, etc. Apesar de tipicamente poderem monitorar qualquer tipo de dados, sistemas de controle de versão trazem benefícios importantíssimos para projetos de software e, com isso, se tornaram prática padrão neste campo. Dentre as ferramentas de SCV disponíveis atualmente, o Git é o mais popular entre desenvolvedores. A ferramenta é utilizada hoje para versionar desde pequenos projetos pessoais, com alguns megabytes de tamanho, até repositórios corporativos massivos com mais de 300 GB e 3,5 milhões de arquivos. Por esse motivo, velocidade e escalabilidade estão entre as principais prioridades para a comunidade de desenvolvimento do Git. No entanto, o desempenho da ferramenta por vezes se encontra aquém do desejado em sistemas de arquivos em rede (NFS), onde operações de entrada e saída (E/S) costumam ser mais custosas. Em particular, uma operação do Git que sofre com estes custos é o checkout, que é responsável por restaurar arquivos de versões específicas de um projeto. Diversas otimizações foram empregadas em códigos relacionados à operação de checkout ao longo do tempo, mas o processamento sequencial dos arquivos ainda trazia uma penalidade de tempo grande para NFS, além de ser subótimo para sistemas de arquivo locais em SSDs. Neste projeto, trabalhamos para paralelizar o maquinário de checkout do Git, resultando em speedups de até 4,5x em NFS e 3,6x em SSDs. Também estudamos como o paralelismo afeta as tarefas de E/S realizadas pela operação de checkout em diferentes máquinas e dispositivos de armazenamento. A funcionalidade de checkout paralelo foi incorporada ao repositório upstream do Git e disponibilizada para todos os usuários da ferramenta na sua versão 2.32.0, que foi lançada em Junho de 2021.Biblioteca Digitais de Teses e Dissertações da USPLejbman, Alfredo Goldman VelBernardino, Matheus Tavares2022-07-13info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/45/45134/tde-31082022-210254/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2022-09-02T20:28:11Zoai:teses.usp.br:tde-31082022-210254Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212022-09-02T20:28:11Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications Paralelizando o Git Checkout: um estudo de caso sobre paralelismo de E/S em aplicações desktop |
title |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications |
spellingShingle |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications Bernardino, Matheus Tavares Git Git Network file systems Paralelismo em E/S Parallel I/O Parallel programming Programação paralela Sistemas de arquivos em rede Sistemas de controle de versões Version control systems |
title_short |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications |
title_full |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications |
title_fullStr |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications |
title_full_unstemmed |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications |
title_sort |
Parallelizing Git Checkout: a case study of I/O parallelism on desktop applications |
author |
Bernardino, Matheus Tavares |
author_facet |
Bernardino, Matheus Tavares |
author_role |
author |
dc.contributor.none.fl_str_mv |
Lejbman, Alfredo Goldman Vel |
dc.contributor.author.fl_str_mv |
Bernardino, Matheus Tavares |
dc.subject.por.fl_str_mv |
Git Git Network file systems Paralelismo em E/S Parallel I/O Parallel programming Programação paralela Sistemas de arquivos em rede Sistemas de controle de versões Version control systems |
topic |
Git Git Network file systems Paralelismo em E/S Parallel I/O Parallel programming Programação paralela Sistemas de arquivos em rede Sistemas de controle de versões Version control systems |
description |
A version control system (VCS) is a tool that tracks and manages the changes made to a set of files over time. More broadly, VCS tools can also help to shape and manage collaboration flows, find and fix bugs, remember the motivations behind a given code change, etc. Although these tools can typically track any type of data, version control systems bring huge benefits to software projects and, as a result, have become standard practice in this field. Among the VCS tools available today, Git is the most popular among developers. This tool is currently being used to version control a variety of repositories, from small personal projects of a few megabytes in size to massive corporate repositories with more than 300 GB and 3.5 million files. For that reason, speed and scalability are among the top priorities for the Git development community. However, the performance of the tool sometimes falls short of what is desired on networked file systems (NFS), where input and output (I/O) operations tend to be more costly. In particular, one Git operation that suffers from these costs is checkout, which is responsible for restoring files from specific versions of a project. Various optimizations were employed on code related to the checkout operation over the years, but the sequential processing of files still carried a large time penalty for NFS, as well as being suboptimal for local file systems on SSDs. In this project, we worked to parallelize the Git checkout machinery, resulting in speedups of up to 4.5x on NFS and 3.6x on SSDs. We also study how parallelism affects the I/O tasks performed by the checkout operation on different machines and storage devices. The parallel checkout feature was incorporated into the upstream Git repository and made available to all users of the tool since version 2.32.0, which was released in June 2021. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-07-13 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/45/45134/tde-31082022-210254/ |
url |
https://www.teses.usp.br/teses/disponiveis/45/45134/tde-31082022-210254/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815257518168866816 |