Giter VIP home page Giter VIP logo

reproducibility-workflow's Introduction

Reproducbility workflow

Every project should consist of a single well structured directory with meaningful subdirectories. Every project should be its own git repository that is hosted on GitHub.

Data cleaning and analyses should be carefully documented in a Jupyter Notebook or R markdown file and should be created with reproducibility in mind. Everyone on the team (and future you) should be able to re-create what you have performed.

The overall purpose is to have an organized project structure in place so that the project is easily approachable to many different individuals.

Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why - Bill Noble

An example of a reproducible project that follows this workflow lives in the project_example/ folder.

Project Structure

The project will typically consist of the following subdirectories:

Data

Original raw data files data should be backed up on something like Google Drive, Dropbox or Box. The raw data itself should never be touched manually. Instead, you should have scripts or notebooks that load the raw data into an R or Python environment for in-environment data manipulation (this will not modify the raw data files themselves).

Any data that is produced by code should be saved in the data/processed_data/ subdirectory.

Documents

This is a good place to keep meeting notes, data dictionaries, and any other associated materials.

Code

There are three types of code documents:

  1. Function scripts (.R, .py): scripts that contain reusable functions that will be called in the action scripts below (and possibly in the exploration notebooks). By convention, function scripts are given the name xx_funs_yy.R, where xx is a number and yy describes what the functions are for (e.g. 01_funs_clean_data.R).

  2. Action scripts (.R, .py): scripts that perform activities such as a detailed data cleaning pipeline, or running many models. Often these scripts will load in data, do something to it (e.g. clean it or fit a model to it) and will then save a new object (such as a cleaned dataset or model results). By convention, action scripts are given the name xx_do_yy.R, where xx is a number and yy describes what action is undertaken by running the script (e.g. 01_do_clean_data.R).

  3. Exploration notebooks (.Rmd, .ipynb): R Markdown or Jupyter notebook files that are used to produce figures and explanatory files that contain figures and explanations of data cleaning steps and results of analyses. These are the files that an external viewer would find useful to understand your data and analysis.

Scripts that are run sequentially are numbered accordingly. An example of a project structure is shown below. Note that in the example below the functions folder is nested as a subdirectory of the scripts folder which contains the actionable scripts. This makes sense when the functions are only called in the actionable scripts (but not in the exploration notebooks).

project
│   README.md
└───data/
│       └───raw_data/
│           │   data_orig.csv
│       └───processed_data/
│           │   data_clean.csv
│       └───results/
│           │   model_results.csv
└───documents/
│       meeting_notes.md
│       data_dictionary.md
└───code/
│       └───exploration/
│           │   01_data_exploration.Rmd
│           │   02_model_results.Rmd
│       └───scripts/
│           │   01_do_clean_data.R
│           │   02_do_model_data.R
│           └───functions/
│               │   01_funs_clean_data.R
│               │   02_funs_model_data.R

Syntax and conventions

All filenames are always lowercase and use underscores to separate words.

Code should follow an appropriate style guide:

Resources

Acknowledgements

Thanks very much to Ciera Martinez for sharing her project workflow.

I'd also like to acknowledge the Meta Research and Best Practices working group (formerly the Reproducility working group) at the Berkeley Institute for Data Science (BIDS) for insightful discussions that have helped me form my own workflow over the years.

reproducibility-workflow's People

Contributors

rlbarter avatar

Watchers

James Cloos avatar

reproducibility-workflow's Issues

Add notes and drafts/temp folders

When doing a literate programming, it is often the case to create drafts ou temporary documents to test hypothesis or just store some data.

Normally, on a system folder structure, temporary files are quickly erased from the main project so they could also be included in the .gitignore file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.