Hands-on workshop material for using Jupyter Notebooks on the CloudOS platform
Time | Programme |
---|---|
9:00-9:15 | ๐ Welcome Address - Dr. Anne Deslattes Mays / Dr. Maria Chatzou Dunford |
9:15-9:30 | Demoing the end result |
9:30-10:00 | 1. Introduction to literate programming workflows with Jupyter Notebooks |
10:00-10:10 | Configuration of Jupyter Notebooks on Lifebit CloudOS |
10:10-11:00 | 2. Exploratory Data Analysis Fundamentals with Pandas in Jupyter Notebooks |
11:00-11:20 | โ Coffee Break |
11:20-13:00 | 3. Interactive plotting with R in Jupyter Notebooks |
13:00-14:00 | ๐ฑ Lunch |
14:00-15:00 | 4. Will it reproduce? Revisiting a publication authored with Jupyter Notebooks |
15:00-15:20 | โ Coffee Break |
16:00-16:45 | 5. Integrative Genomics Viewer on Lifebit CloudOS |
16:50-17:00 | ๐ Closing Remarks |
After you have forked the repository, <USER_aka_you>/jax-jupyter
, click on and click SSH
to change the url.
Installing packages
- with
pip
- with
conda
To add conda channels on our machine, go to CloudOS and initialise a Jupyter Notebook session.
- Click
New analysis
>Jupyter Session
- Create a new Project by clicking
New
and name it `reproducible-paper. - Choose an instance, for example
m1.xlarge
which has 4cpus and 15Gb RAM (it should be more than enough).
Once the instance has initialised, go to your Jupyter Notebook session and click on the top left corner File
> New
> Terminal
.
This will open a terminal. Type the following commands to add conda channels, so that we can subsequently install libraries with conda:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
WIP: Create markdown table for the most handy ones, here is a good place to start: http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html
Pandas is an excellent python library that simplifies basic data handling tasks for dataframes. In this part, we will explore a well known dataset, iris
, and we will learn how to perform basic data cleaning and handling operations using pandas utilities along with python build in functions. We will use Pandas to read tabular files in a dataframe, inspect the dimensions, variable values and number of observations, create subsets of our data based on column and row filtering criteria and retrieve summary statistics for our dataset. In the end of this part, we will have created a basic exporatory data analysis workflow for performing a minimal data quality control routine.
Basic built-in functions and methods that we will use:
.type()
.describe()
.shape()
.DataFrame()
For the second part of the workshop, we will switch to R to explore the great functionalities of ggplot2
and related R packages for publication ready plots. We will first create a suitable input table with helper variable to facilitate the plotting. We will explore the ggplot layers and customise different aesthetics to create publication ready plots. Finally, we will add interactivity to our plots by using the plotly
library. We will also explore two extremely handy R packages, ggpubr
and
What does it take to submit a fully reproducible research piece? Notebooks, Jupyter Notebooks or Rmarkdown documents for example, have been proposed as a promising solution to accompany and bundle together in a reproducible report all the supplementary data of an scientific analysis that get published. Enabling other researchers reproduce our published work, offers the possibility of Postpublication Peer Review. During the Open Data Day 2017, hosted by SPARC and the NIH, within the two-day hackathon organised, an attempt to analyse all Jupyter notebooks mentioned in PubMed Central started, with many contributors from the open source community, including Daniel Mietchen. Thanks to this initiative, a collection of metadata for publications including Jupyter Notebooks was created. At the time, approximately 100 publications from EuropePMC, which included Jupyter Notebooks to assist with reproducibility of their analysis were documented. Our goal for this part of the workshop is to select a publication with a Jupyter Notebook, and try to reproduce the analysis. The objective of this part is to realise the minimal requirements of a reproducible workflow and hopefully explore the literate programming workflow of the publication authors. The repository where this effort was coordinated can be found here: https://github.com/sparcopen/open-research-doathon. A relevant JupyterCon talk by Daniel Mietchen can be also found here: https://youtu.be/Via7gBrjxHI
Instructions:
Go to CloudOS and initialise a Jupyter Notebook session:
- Click
New analysis
>Jupyter Session
- Create a new Project by clicking
New
and name it `reproducible-paper. - Choose an instance, for example
m1.xlarge
which has 4cpus and 15Gb RAM (it should be more than enough).
We don't need to select data because we are going to bring code and data from github.
Go to RainCloudPlots/RainCloudPlots and click
Copy the link to the repository, we will need the link to clone it.
a) Go back to your Jupyter Notebook session and click on the top left corner File
> New
> Terminal
.
b) Navigate to the RainCloudPlots/tutorial_python/
folder in your Jupyter Lab session.
c) Find and click raincloud_tutorial_python.ipynb
to launch the Jupyter Notebook.
We are now ready to start reproducing the analysis, by running each cell of the Notebook.
Make an issue and file the problems we have noticed
Moving on to our last part of our tutorial, we will learn how to use Integrative Genomics Viewer (IGV) on the CloudOS platform and easily connect data from your cloud provider. Contrary to the usual configuration required to use IGV, there are no prerequisites or installation steps to use IGV on CloudOS, nor the need to copy data. To access data we will simply link the data from our cloud storage provider and continue with the IGV analysis.