Using Git: Tales of Peril, Pain and Protection

GitHub Punk OctoCat

Talk given at the 2022 Annual Conference of American Society of Agronomy, Soil Crop Science Society of America and the Soil Science Society of America held in Baltimore, Maryland.

Abstract

Git is widely used by scientists and programmers. It is part of a suite of version control tools to track file changes through time. These tools are enormously useful for enabling collaboration both within a small group and across many individuals. Version control helps prevent data loss through off-site storage and by recording changes in data files over time. GitHub and other remote repositories such as GitLab are useful for accessing other’s people data, downloading software, finding documentation and source code, and interacting with people who generate and maintain data and code about their work. These cloud repositories also have extensive functionality for additional tasks such as building and deploying websites. However, if you have ever used git, you likely have learned what a humbling experience version control can be. The point is to track your changes through time and/or share data, but achieving that involves learning a somewhat bewildering array of new terms and concepts (e.g., “personal access token”, “rebase”, “working tree”, or “git revert”, whose action does not fully align with its name). Furthermore, if you have used git for any amount of time, you have likely encountered errors such as a merge conflicts. Despite these challenges, I have found that the advantages of using git and GitHub or other remote repositories are worth the occasional inconvenience. In this talk, I’ll review how git has supported my work as a data scientist and statistical consultant, and provide guidance on how to start incorporating git into your regular workflow.