5 Version Control & Collaboration

Good Software Engineering Practice for R Packages

Liming

August 1, 2024

Disclaimer




Any opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of their respective employer or company.

  • Overview, demo, practical
  • Can only scratch surface
  • More resources on website

Trade-offs in code development


Working alone

  • no coordination overhead
  • no review
  • lack of diversity
  • can slack on documentation
  • fragile long-term maintenance

Working in a team

  • coordination overhead
  • mutual review of code
  • different approaches
  • forced to document
  • more robust long-term maintenance

Key issue:
Manage complexity over time or between people

Version control systems (VCS)

  • Manage different versions of a piece of work
  • Compare and merge diverged versions effectively1

flowchart LR
  A[<font size=4> Liming v1] --> B[<font size=4> Liming v2]
  B --> C[<font size=4> Liming v3]
  B --> D[<font size=4> Joe v1]
  D --> E[<font size=4> Joe + Liming v4]
  C --> E

  • Code is complex system \(\leadsto\) ideal application of VCS
  • Compounded by multiple people ‘fiddling’ with it!

git basics

Enter git the ‘Latin of data science’

  • Author Linus Torvalds, for work on Linux kernel
  • Essentially a database with snapshots of a monitored ‘repository’ (directory)
  • Optimized to compute line-based changes
  • Integrated in RStudio IDE, Visual Studio Code
  • De facto standard not just in the R world
  • Alternatives: mercurial, SVN, …

Stage & commit

gitGraph
   commit
   commit
   commit
   commit
   commit

  1. ‘Stage’ changes for inspection
    • allows to inspect propose changes before locking them in
  2. Permanently ‘commit’ changes to git

\(\leadsto\) Chain of versions with incremental changes

Line-based differences - the ‘diff’

  • Changes in git are line-based
  • Additions (green) & deletions (red) between commits

Going back in time

  • Every commit has unique hash value
  • Can ‘checkout’ old commit (browse history)
git checkout [commit hash to browse]
  • Can ‘reset’ changes
git reset --hard [commit hash to reset to]
  • Removes need for my-file_final_v2_2019.R
  • Time travelling has its dangers…1

Branching

gitGraph
   commit
   commit
   branch feature
   checkout feature
   commit
   commit
   checkout main
   commit

  • Variations of repository: ‘branches’
git checkout -b [my new branch]
  • Quick switching between branches
git checkout [branch name]

‘Merging’ two branches

gitGraph
   commit
   commit
   branch feature
   checkout feature
   commit
   commit
   checkout main
   commit
   merge feature

  • Consolidate diverged ‘branches’
  • Usually merged automergically
  • Conflicting changes
  • Line edited in source/target branch - keep which?
  • Resolving merge conflicts beyond today’s scope

Example of ‘gitflow’

gitGraph
   commit tag: "v0.0.1"
   commit
   branch feature-1
   checkout feature-1
   commit
   commit
   checkout main
   branch feature-2
   checkout feature-2
   commit
   checkout feature-1
   commit
   checkout main
   commit tag: "bugfix"
   merge feature-1 tag: "v0.1.0"
   checkout feature-2
   commit

  • ‘gitflow’: specific workflow for git repositories
  • features developed on branches, then merged into ‘main’

Version Control & Collaboration

  • git itself is command line tool for version control
  • git platforms add UI for collaboration1
  • git + GitHub
    • VCS (git)
    • Web hosting of code (GitHub)
    • Organisation with issues, discussions (GitHub)
    • Automation of checks/test (GitHub)

git platforms

GitHub.com

  • Huge number of R packages developed there:
  • 100 million developers on GitHub.com (Jan ’23)
  • 372 million repositories, 28 million public (Jan ’23)
  • ‘Facebook’ of developers / social coding
  • Discuss problems / propose changes

Branches & pull requests

  • Branches are a git concept
  • Git platforms add concept of ‘pull request’ (PR)1
    • PR is ‘suggested merge’ from branch A to B
    • Usually from ‘feature A’ to ‘main’
  • Allow to preview problems before merge and discuss changes
  • Once everyone is happy, a pull request2 can be merged
  • Every PR has an associated branch, but not every branch has a PR
  • More in the demo!

Automating things with GitHub

  • GitHub provides
  • Allows task automation, e.g.
    • run unittests
    • build & host documentation
    • static code analysis (linting)
  • Most important actions for R: github.com/r-lib/actions
  • Extremely useful to enforce best-practices & quality

A typical GitHub workflow

sequenceDiagram
    participant S as Liming
    participant GH as GitHub server
    participant J as Joe
    S->>S: make change locally & commit to <feature>
    S->>GH: push commit
    S->>GH: open pull request
    GH->>GH: run automated checks
    S->>J: request review
    J->>J: review code
    J->>S: request changes
    S->>S: implement changes locally & commit
    S->>GH: push commit
    GH->>GH: run automated checks
    S->>J: request review
    J->>J: review code
    J->>GH: approve changes, unblocking merge
    GH->>GH: merge <feature> into <main>
    GH->>GH: run automated checks on <main>
    GH->>J: pull newest version of <main>

Looks awefully complicated, why?

  • Efficient collaboration with novice/untrusted contributors
    • Maintainer: automated checks reduce review burden
    • Contributor: no need to check manually
  • Branching promotes asynchronous work on features
  • Full history - can always go back

\(\leadsto\) making code-collaboration scalable

Demo

Practical - collaboration on GitHub

  • Work in teams of ~ 3 or 4
  • Go to https://github.com/kkmann/simulatr and read through the instructions in the README.md file
  • The repository is a template to practice collaboration on GitHub
  • Only one member per team needs to use the template and invite the others as collaborators!
  • Take some time to checkout the README.md file and set up your environment in posit cloud
  • Can you fix the errors with some pull requests?
  • The purpose of this exercise is to explore the collaboration functionality of GitHub - not to produce a perfect package ;)

Open sorurcing and tagging on GitHub

Open Sourcing

  • The easiest way to “open source” your R package is to make the GitHub repository public
  • This allows for easy open source contributions from other developers via pull requests
  • Please check with your organization first:
    • Are they ok to publish the software?
    • What is the appropriate copyright holder?
  • Also allows bugs to be filed and to have the GitHub issues page in the package description

Versioning

  • The Version field defines the package version
  • Syntax: Three integers separated by . or -
    • Canonical form is: x.y-z, equivalent to x.y.z
  • Useful conventions of “semantic versioning”:
    • x is major: Increment this for breaking changes
    • y is minor: Increment this for new features
    • z is patch: Increment this for bug fixes only
    • x.y.z.9000 and count up during development
    • usethis::use_version() can help with this

Tags

  • Tags are a feature of Git, i.e., not specific to GitHub
  • Git can tag specific points in the code history as being important
  • Typically, for each release, create a tag vx.y.z
  • The value here is that users can later check out the package in the state of this release version
    • Download in R: remotes::install_github("org/package", ref = "vx.y.z")
    • Comparison of versions are also possible, etc.

Tags: Example

Releases

  • Based on Git tags, and a feature of GitHub
  • Are “deployable software packages to make them available for a wider audience to download and use”
  • Contain release notes and links to the binary package files for download
    • However, for R packages these tar.gz package files are rarely used directly

Releases: Example

License information