Giter VIP home page Giter VIP logo

example_analysis's Introduction

README

Keith Baggerly 2018-11-12

Overview

We want to illustrate assembly of a reproducible analysis using a dataset we care about. Our workflow closely follows that of Jenny Bryan’s packages-report-EXAMPLE on GitHub.

Several years ago, Potti et al claimed to have found a way to use microarray profiles of a specific panel of cell lines (the NCI60) to predict cancer patient response to chemotherapeutics from a similar profile of the patient’s tumor. Using different subsets of cell lines, they made predictions for several different drugs. We wanted to apply their method, so we asked them to send us lists of which cell lines were used to make predictions for which drugs. The method doesn’t work; we describe our full analyses here.

The first dataset we received from Potti et al didn’t have the cell lines labeled. We want to see if we can identify where the numbers came from and see if there were any oddities that should have raised red flags early on. To do this, we examine 3 datasets:

  • array data we got from Potti et al
  • array data for the NCI60 cited as the source for the predictors
  • array data for two GEO datasets (GSE349, GSE350) cited as a validation set for the docetaxel signature

Brief Results

  • 01_gather_raw_data downloads the raw datasets used from the web.
  • 02_parse_potti_data reorganizes array data we got from Potti et al for later use and runs some basic exploratory data analyses (EDA). Sample correlations show all samples for Cytoxan and Docetaxel are very different from everything else. The minimum values for these columns are all 5.89822, indicating thresholding.
  • 03_parse_nci60_data reorganizes the NCI60 array data we obtained from the web and runs some basic EDA. Most of the 59 cell lines were profiled in triplicate; 1 was run twice and 4 were run 4 times.
  • 04_parse_geo_data reorganizes the GEO array data we obtained from the web and runs some basic EDA. The minimum values for these arrays are all 5.89822, indicating thresholding.
  • 05_identify_potti_sources checks for matches across datasets. We find perfect matches in the NCI60 data for all non-Cytoxan/Docetaxel Potti samples; in all cases the first (“A”) replicate was used. We find perfect matches in the GEO data for the other Potti samples matching on row order. Since matching was not by probeset id, these values are effectively scrambled.
  • 06_report_potti_data_sources summarizes the results of our analyses.
  • 90_kludges_and_warnings notes “toc” doesn’t currently work with github_document and discusses our workaround.

Running the Analysis

Roughly, our analyses involve running the R and Rmd files in R in the order they appear.

Run R/95_make_clean.R to clear out any downstream products.

Run R/99_make_all.R to re-run the analysis from beginning to end, including generating this README.

Raw data from the web is stored in data.

Reports and interim results are stored in results.

Required Libraries

These analyses were performed in RStudio 1.2.1114 using R version 3.5.1 (2018-07-02), and use (in alphabetical order):

  • downloader 0.4
  • dplyr 0.7.6
  • fs 1.2.6
  • here 0.1
  • magrittr 1.5
  • readr 1.1.1
  • rmarkdown 1.10
  • tibble 1.4.2
  • tidyr 0.8.1
  • tools 3.5.1

Many of these packages (dplyr, magrittr, readr, tibble, tidyr) are in the tidyverse, and I generally just load that.

  • tidyverse 1.2.1

example_analysis's People

Contributors

kabagg avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.