Use Consistent and Intuitive Data Layout

Recipes for Repeatable Experiments

Image for post
Image for post
  • Some actions will be performed on the data as part of an experiment. These actions may need to be performed in a specific order and they may consume initial data or data generated by other actions. A script describes such actions along with any constraints on ordering of actions and the data flow between actions, e.g., calculate the mean of field f1, calculate the mean of f2, and compare calculated means via a t-test.
  • The execution of scripts generates data that is based the initial data or data generated by other scripts. Let’s refer to such generated data as output data.

Recipe

  1. Have a dedicated folder for each experiment, e.g., evaluate-tools.
  2. Place initial/bootstrap/input data under input folder, e.g., evaluate-tools/input. The contents of this folder should not change during the experiment.
  3. Capture each data processing step in a separate script. Place the script in scripts folder, e.g., evaluate-tools/scripts/calculate-means.sh.
  4. Place data generated by executing scripts under output folder, e.g., evaluate-tools/output.
  5. Create a master script that executes various scripts (steps) to orchestrate/run the experiment, e.g., evaluate-tools/masterScript.sh.
  6. Document the experiment in a README.md file, e.g., evaluate-tools/README.md.
  7. Use a version control system to store the artifacts of the experiment.

Written by

Programming, experimenting, writing | Past: SWE, Researcher, Professor | Present: SWE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store