README

Keith Baggerly 2018-11-12

Overview
Brief Results
Running the Analysis
Required Libraries

Overview

We want to illustrate assembly of a reproducible analysis using a dataset we care about. Our workflow closely follows that of Jenny Bryan’s packages-report-EXAMPLE on GitHub.

Several years ago, Potti et al claimed to have found a way to use microarray profiles of a specific panel of cell lines (the NCI60) to predict cancer patient response to chemotherapeutics from a similar profile of the patient’s tumor. Using different subsets of cell lines, they made predictions for several different drugs. We wanted to apply their method, so we asked them to send us lists of which cell lines were used to make predictions for which drugs. The method doesn’t work; we describe our full analyses here.

The first dataset we received from Potti et al didn’t have the cell lines labeled. We want to see if we can identify where the numbers came from and see if there were any oddities that should have raised red flags early on. To do this, we examine 3 datasets:

array data we got from Potti et al
array data for the NCI60 cited as the source for the predictors
array data for two GEO datasets (GSE349, GSE350) cited as a validation set for the docetaxel signature

Brief Results

01_gather_raw_data downloads the raw datasets used from the web.
02_parse_potti_data reorganizes array data we got from Potti et al for later use and runs some basic exploratory data analyses (EDA). Sample correlations show all samples for Cytoxan and Docetaxel are very different from everything else. The minimum values for these columns are all 5.89822, indicating thresholding.
03_parse_nci60_data reorganizes the NCI60 array data we obtained from the web and runs some basic EDA. Most of the 59 cell lines were profiled in triplicate; 1 was run twice and 4 were run 4 times.
04_parse_geo_data reorganizes the GEO array data we obtained from the web and runs some basic EDA. The minimum values for these arrays are all 5.89822, indicating thresholding.
05_identify_potti_sources checks for matches across datasets. We find perfect matches in the NCI60 data for all non-Cytoxan/Docetaxel Potti samples; in all cases the first (“A”) replicate was used. We find perfect matches in the GEO data for the other Potti samples matching on row order. Since matching was not by probeset id, these values are effectively scrambled.
06_report_potti_data_sources summarizes the results of our analyses.
90_kludges_and_warnings notes “toc” doesn’t currently work with github_document and discusses our workaround.

Running the Analysis

Roughly, our analyses involve running the R and Rmd files in R in the order they appear.

Run R/95_make_clean.R to clear out any downstream products.

Run R/99_make_all.R to re-run the analysis from beginning to end, including generating this README.

Raw data from the web is stored in data.

Reports and interim results are stored in results.

Required Libraries

These analyses were performed in RStudio 1.2.1114 using R version 3.5.1 (2018-07-02), and use (in alphabetical order):

downloader 0.4
dplyr 0.7.6
fs 1.2.6
here 0.1
magrittr 1.5
readr 1.1.1
rmarkdown 1.10
tibble 1.4.2
tidyr 0.8.1
tools 3.5.1

Many of these packages (dplyr, magrittr, readr, tibble, tidyr) are in the tidyverse, and I generally just load that.

tidyverse 1.2.1

kabagg / example_analysis Goto Github PK

example_analysis's Introduction

README

Overview

Brief Results

Running the Analysis

Required Libraries

example_analysis's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent