Giter VIP home page Giter VIP logo

stemangiola / bioc_2020_tidytranscriptomics Goto Github PK

View Code? Open in Web Editor NEW
27.0 5.0 10.0 49.71 MB

Workshop on tidytranscriptomics: Performing tidy transcriptomics analyses with tidybulk, tidyverse and tidyheatmap

Home Page: https://stemangiola.github.io/bioc_2020_tidytranscriptomics/

License: Other

Dockerfile 27.66% R 72.34%
tidyverse differential-expression tidybulk heatmap pca mds transcriptomics tidy-data bioconductor r

bioc_2020_tidytranscriptomics's Introduction

DOI .github/workflows/basic_checks.yaml Docker

A Tidy Transcriptomics introduction to RNA sequencing analyses

bioc2020 tidybulk

Instructor names and contact information

  • Maria Doyle <Maria.Doyle at petermac.org>
  • Stefano Mangiola <mangiola.s at wehi.edu.au>

Syllabus

Material webpage.

Video recording of the workshop.

This material was created for a Bioc2020 conference workshop but it can also be used for self-learning.

More details on the workshop are below.

Workshop package installation

This is necessary in order to reproduce the code shown in the workshop. The workshop is designed for R 4.0 and packages from the 3.12 devel branch of Bioconductor. It can be installed using one of the two ways below.

Via Docker image

If you're familiar with Docker you could use the Docker image which has all the software pre-configured to the correct versions.

docker run -e PASSWORD=abc -p 8787:8787 stemangiola/bioc_2020_tidytranscriptomics:bioc2020

Once running, navigate to http://localhost:8787/ and then login with Username:rstudio and Password:abc.

You should see the Rmarkdown file with all the workshop code which you can run.

Via GitHub

Alternatively, you could install the workshop using the commands below in R 4.0.

devtools::install_github("stemangiola/[email protected]")
devtools::install_github("stemangiola/bioc_2020_tidytranscriptomics", build_vignettes = TRUE)
library(bioc2020tidytranscriptomics)
browseVignettes("bioc2020tidytranscriptomics")

To run the code, you could then copy and paste the code from the workshop R markdown file into a new R Markdown file on your computer.

Workshop Description

This workshop will present how to perform analysis of RNA sequencing data following the tidy data paradigm. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions.

This can be achieved for RNA sequencing data with the tidybulk, tidyHeatmap and tidyverse packages. The tidybulk package provides a tidy data structure and a modular framework for bulk transcriptional analyses. tidyHeatmap provides a tidy implementation of ComplexHeatmap. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data.

The topics presented in this workshop will be

  • Data exploration
  • Data dimensionality reduction and clustering
  • Differential gene expression analysis
  • Data visualisation

Pre-requisites

  • Basic knowledge of RStudio
  • Familiarity with tidyverse syntax

Recommended Background Reading Introduction to R for Biologists

Workshop Participation

The workshop format is a 55 min session consisting of a 30 min demo followed by 25 min opportunity for attendees to try out the code, exercises and Q&A.

R / Bioconductor packages used

  • tidyverse
  • tidybulk
  • tidyHeatmap
  • edgeR
  • ggrepel
  • airway

Time outline

Activity Time
Demo 30m
Introduction and Data preprocessing
Data dimensionality reduction and clustering
Differential gene expression
Data visualisation
Try out code, Exercises, Q&A 25m

Workshop goals and objectives

In exploring and analysing RNA sequencing data, there are a number of key concepts, such as filtering, scaling, dimensionality reduction, hypothesis testing, clustering and visualisation, that need to be understood. These concepts can be intuitively explained to new users, however, (i) the use of a heterogeneous vocabulary and jargon by methodologies/algorithms/packages, (ii) the complexity of data wrangling, and (iii) the coding burden, impede effective learning of the statistics and biology underlying an informed RNA sequencing analysis.

The tidytranscriptomics approach to RNA sequencing data analysis abstracts out the coding-related complexity and provides tools that use an intuitive and jargon-free vocabulary, enabling focus on the statistical and biological challenges.

Learning goals

  • To understand the key concepts and steps of bulk RNA sequencing data analysis
  • To approach data representation and analysis though a tidy data paradigm, integrating tidyverse with tidybulk and tidyHeatmap.

Learning objectives

  • Recall the key concepts of RNA sequencing data analysis
  • Apply the concepts to publicly available data
  • Create plots that summarise the information content of the data and analysis results

bioc_2020_tidytranscriptomics's People

Contributors

mblue9 avatar stemangiola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bioc_2020_tidytranscriptomics's Issues

Workshop TODOs

  • Update README [https://github.com//issues/13 ]
  • Add License
  • Add bibliography. Should make a bibliography file with refs like tidyverse, edgeR, 1-2-3 limma papers.
  • Maybe add in some comparative examples to show how tidy style can save on writing large code blocks
  • Improve exercises [ https://github.com//issues/12]
  • Could perhaps have an additional_Info.Rmd file in the vignettes folder with things we won't have time to include but that could be useful to show (e.g. how to use tidybulk if starting from tables of counts & sample information, bar plots, pathway/GO analyses...)
  • Shorten to 1 hr & update timings (have shortened it a bit but may need to adjust more)
  • Add image at beginning to show RNA-seq workflow & where this fits

str_replace not found

The test code you gave me

counts_de_pasilla <- tidyHeatmap::pasilla %>%
mutate(type=str_replace(type, "-", "_")) %>%
test_differential_abundance(
.sample=sample,
.transcript=symbol,
.abundance=count normalised adjusted,
.formula = ~ 0 + condition + type,
.contrasts = c("conditiontreated - conditionuntreated"))

use str_replace

however stringr is not in DESCRIPTION library nor suggest

scale_abundance no longer takes the factor_of_interest argument

When running the following code form the course notebook to scale counts:

counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)

I get the following error:

Error in scale_abundance(., factor_of_interest = dex) : 
  unused argument (factor_of_interest = dex)

This does not occur when using the legacy versions on the docker image provided. Please could you advise whether there is a workaround for the scaling of counts with the up to date version of tidybulk? This would be useful for future projects! Thanks!

keep_variable vs base R

Heya, I just found I do not seem to get the same result with the base R and tidybulk code for the keep_variable genes e.g.

these gene ids

counts_scaled %>% 
	
	# extract 500 most variable genes
	keep_variable( .abundance = counts_scaled, top = 500)

are not the same as these gene ids

dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group=group)
dgList <- dgList[keep.exprs,, keep.lib.sizes=FALSE]
dgList <- calcNormFactors(dgList)
logcounts <- cpm(dgList, log=TRUE)
var_genes <- apply(logcounts, 1, var)
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
highly_variable_lcpm <- logcounts[select_var,]

and using your code on the tidybulk website (below) I get same 500 as with base R but again different to tidybulk

s <- rowMeans((logcounts-rowMeans(logcounts))^2)
o <- order(s,decreasing=TRUE)
x <- logcounts[o[1L:500],,drop=FALSE]

I thought it might be to do with the filtering so I tried adding in filter as below (I should probably add that filter into the workshop code) but I'm still getting about 23 genes different to the base R code

counts_scaled %>% 
filter(!lowly_abundant) %>% 
keep_variable( .abundance = counts_scaled, top=500)

Do you know why that might be?

Improve exercises

As discussed we should see if we can improve the exercises (the current ones are not final just some to get started)

For Tip 1 we could aim for ~6 sets of exercises in total for a 2x50 min workshop (think that's the time we've got) where each set takes 1-2 mins. Maybe 2-3 exercises per set - 1 easy, others harder. For Tip 3 we should aim to have a variety of exercise types.

Tip 1: Use formative assessment every 10โ€“15 minutes
Instructors always want to get through more material than time allows, so we often teach at the speed at which we can talk rather than the speed at which people can learn. Having learners do something every 10โ€“15 minutes slows us down to the speed at which people can learn rather than the speed at which we can talk. It also keeps them engaged and gives us and them feedback on whether they have actually understood.

In-class checks like this are called formative assessments. Good ones take only a minute or two to complete so that they don't derail the flow of the lesson and have an unambiguous correct answer so that they can be checked in large classes. Popular kinds of formative assessment in programming classes include:

  • Answer a multiple choice question.
  • Write a few lines of code.
  • Predict what the code on the screen will do when it runs.
  • Contribute the next line of code.
  • Label a diagram of a data structure.
  • Trace the order in which statements are evaluated.

Starting with a formative assessment that reviews a previous lesson is a good way to signal that class has started, and having learners recall older material before tackling something new improves learning outcomes [11]. Similarly, ending the class with such an exercise gives learners a sense of how far they have progressed.

Some formative assessments should be designed in advance; in fact, they should be designed before the lesson content is written so that they can act as goalposts [6]. However, they can also be created on the fly to incorporate and respond to learners' questions and confusions. For example, after writing and presenting a few lines of code, an instructor can ask what would happen if something was added or modified. If learners make different predictions, the instructor can then ask them to debate the outcomes as a form of ad hoc peer instruction [7].

Tip 3: Use a variety of exercise types
The final rule in [7] said, "Don't just code," and it bears elaboration here. Most programming classes rely primarily on code-and-run exercises in which learners write software that behaves in a tightly specified way. To keep learners engaged and to give them opportunities to practice other skills and higher-level reasoning, you should also use the following:

Parsons Problems, which give them the lines of code needed to solve a problem but in jumbled order [12, 13, 14]. Parsons Problems reduce cognitive load by allowing learners to focus on control flow without simultaneously having to recall syntax.

Multiple choice questions whose wrong answers have been chosen to probe for specific misconceptions. For example, learners who have worked with spreadsheets may believe that after executing a = 10, b = a, and a = 20, the value of b will be 20.

Matching and ranking problems in which they match terms from one column to definitions in another, put predefined labels on a diagram, or sort items according to some criteria (e.g., most likely to least likely).

Debugging, completion, and extension exercises in which learners must fix, finish, or extend an existing program. These all model authentic tasks (i.e., the kinds of things programmers spend most of their time doing in real life).

Tracing execution order or tracing values, in which the learner lists the order in which the statements in a program are executed or the values that one or more variables take on as the program runs, which are essential program comprehension skills.

Code reviews in which learners score a program against a marking guide supplied by the instructor in order to learn how to find flaws in code. Learners start with a perfect score and lose points for false positives, as well as false negatives so that they don't simply mark every statement as being wrong in all possible ways.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.