Universal data analysis pipeline

Nextflow-based pipeline to run and deploy reproducible analyses. Alongside the pipeline I developed the toolbox reportsrender to execute notebooks, but it can as well be used without it.

Features

render jupyter notebooks or Rmarkdown notebooks (papermill/knitr)
ensure reproducible analyses
deploy reports to GitHub pages.

Structure

analyses: The actual analysis steps (i.e. jupyter notebooks, Rmarkdown documents, bash scripts) go here.
bin: scripts that can be called from nextflow directly (nextflow will add them to the PATH for commands ran from a process.
data: input data for the notebooks. I often replace this with a symlink to some data storage.
deploy: final reports. Will be filled by the deploy process which copies all html reports to that directory and creates an index file. A great way to share the final reports is to push this directory to Github pages.
envs: conda environment files go here. Create one file per notebook, or re-use environments for multiple notebooks -- it's up to you.
lib: put custom libraries (e.g. python modules) here.
results: final results generated by the pipeline go here. Concept: one can always delete the results directory and re-generate it from data using the pipeline.
tables: manually created input data that I want to be under version control. E.g. the list of samples and the associated patient data that you had to compile manually from three excel sheets because the biologists encoded data as background-color.
main.nf: The nextflow workflow that ties everything together.
nextflow.config: Contains configuration options for the pipeline (e.g. output directory). You can also set options here to run the pipeline on a HPC grid engine (e.g. SGE or SLURM).

How to run.

Install nextflow In this case, we use conda. Check the nextflow webiste for other options.

conda create -n nextflow -c conda-forge -c bioconda nextflow
conda activate nextflow

Clone this repository

gitclone [email protected]:grst/universal_analysis_pipeline.git
cd universal_analysis_pipeline

Run the pipeline

./main.nf

Share the results. You can zip and email the deploy folder. Even better is to share the results using github pages:

To setup GitHub pages, init a repository in the deploy folder and push to the gh-pages branch:

cd deploy
git init
git remote add origin <YOUR_REMOTE>
git checkout --orphan gh-pages
git add -A .
git commit -m "Initial deploy on gh-pages"
git push -u origin gh-pages

It can take a few minutes, but eventually your reports will be available at https://<yourgithubuser>.github.io/<yourrepo>
You might want to "password protect" your pages. This is not natively supported by GitHub pages, but a workaround is to put all files in a cryptic subfolder, e.g. rBymGubVBBrdHtGo6Of35E3uI. As GitHub pages doesn't list directories, you need to know the precise URL to access the folder. You can adjust the deploy dir in nextflow.config.

How to use

This repository is meant as a template. You can fork/clone this repository and expand from there. At least, you have to change two things:

Add your notebooks to the analyses folder
Edit main.nf to wire your notebooks together the right way. You can use reportsrender to execute the notebooks.

Ideas for the future:

convert conda envs to singularity containers to ensure reproducibility.

ameintjes / universal_analysis_pipeline Goto Github PK

universal_analysis_pipeline's Introduction

Universal data analysis pipeline

Features

Structure

How to run.

How to use

Ideas for the future:

universal_analysis_pipeline's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent