Use A Version Control System (VCS)

Recipes for Repeatable Experiments

Image for post
Image for post

Recipe:

  1. Place initial/input, output, and script files under version control.
  2. Ignore irrelevant artifacts, e.g., via .gitignore and .hgignore. Typically, such artifacts comprise intermediate auto-generated files. If such files are dependent on aspects of the environment that can influence output files of the experiment, then place them under version control and capture the aspects of the environment in a file that is under version control.
  3. Ensure relevant artifacts are being versioned. It is common to assume that version control tools will automagically include or ignore newly created files from the version control. Instead, make sure that every file that should be versioned is indeed versioned.
  4. Commit artifacts only after successful execution of the master script. It is common to make tweaks to master script and individual scripts as we progress thru steps of an experiment. If the tweaks are not captured, then the experiment may not be repeatable. So, always execute the master script and check for its successful completion before committing the changes.
  5. Use tags to identify artifacts from different runs/versions of the same experiment. Suppose you run an experiment twice with different parameter. Assuming the results are different and you plan to contrast them, place both two runs of the experiment under version control (as different commits) and tag these commits.
  6. Use tags to refer to artifacts from specific run/version of an experiment. With most version control systems, tags are more human friendly. So, use tags to communicate information about versions of an experiment.
  7. Use Git-LFS for large binary files. If you have large binary files and are using Git as your VCS, then use Git-LFS to store such files and Git to store other files. Text files that are large, auto-generated, and seldom updated are also good candidates for Git-LFS. If you do not us Git, then explore for similar options and use it.
  8. Store hashes (and not content) of huge files from public repositories. When huge data files (e.g., 10+ GBs) from public repositories are used, do not place the contents of the data files under version control. Instead to ensure veracity,
    a) record the source of huge data files in a file that is placed under version control; typically, in README.md file, and
    b) record the relative paths of the huge data files along with their hash/checksum (e.g., md5) in a file that is placed under version control.

    This will help experimenters ensure that they are using the same data files as you did. Further, this also serves as a map of how to layout files to repeat the experiment.

Written by

Programming, experimenting, writing | Past: SWE, Researcher, Professor | Present: SWE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store