Nextflow-based pipeline to run and deploy reproducible analyses. Alongside the pipeline I developed the toolbox reportsrender to execute notebooks, but it can as well be used without it.
- render jupyter notebooks or Rmarkdown notebooks (papermill/knitr)
- ensure reproducible analyses
- deploy reports to GitHub pages.
analyses
: The actual analysis steps (i.e. jupyter notebooks, Rmarkdown documents, bash scripts) go here.bin
: scripts that can be called from nextflow directly (nextflow will add them to thePATH
for commands ran from aprocess
.data
: input data for the notebooks. I often replace this with a symlink to some data storage.deploy
: final reports. Will be filled by thedeploy
process which copies all html reports to that directory and creates an index file. A great way to share the final reports is to push this directory to Github pages.envs
: conda environment files go here. Create one file per notebook, or re-use environments for multiple notebooks -- it's up to you.lib
: put custom libraries (e.g. python modules) here.results
: final results generated by the pipeline go here. Concept: one can always delete the results directory and re-generate it fromdata
using the pipeline.tables
: manually created input data that I want to be under version control. E.g. the list of samples and the associated patient data that you had to compile manually from three excel sheets because the biologists encoded data as background-color.main.nf
: The nextflow workflow that ties everything together.nextflow.config
: Contains configuration options for the pipeline (e.g. output directory). You can also set options here to run the pipeline on a HPC grid engine (e.g. SGE or SLURM).
- Install nextflow In this case, we use conda. Check the nextflow webiste for other options.
conda create -n nextflow -c conda-forge -c bioconda nextflow
conda activate nextflow
- Clone this repository
gitclone [email protected]:grst/universal_analysis_pipeline.git
cd universal_analysis_pipeline
- Run the pipeline
./main.nf
- Share the results. You can zip and email the
deploy
folder. Even better is to share the results using github pages:
- To setup GitHub pages, init a repository in the deploy folder and push to the gh-pages branch:
cd deploy
git init
git remote add origin <YOUR_REMOTE>
git checkout --orphan gh-pages
git add -A .
git commit -m "Initial deploy on gh-pages"
git push -u origin gh-pages
-
It can take a few minutes, but eventually your reports will be available at
https://<yourgithubuser>.github.io/<yourrepo>
-
You might want to "password protect" your pages. This is not natively supported by GitHub pages, but a workaround is to put all files in a cryptic subfolder, e.g.
rBymGubVBBrdHtGo6Of35E3uI
. As GitHub pages doesn't list directories, you need to know the precise URL to access the folder. You can adjust the deploy dir innextflow.config
.
This repository is meant as a template. You can fork/clone this repository and expand from there. At least, you have to change two things:
- Add your notebooks to the
analyses
folder - Edit
main.nf
to wire your notebooks together the right way. You can use reportsrender to execute the notebooks.
- convert conda envs to singularity containers to ensure reproducibility.