compgenomr / book Goto Github PK

View Code? Open in Web Editor NEW

317.0 317.0 272.0 232.29 MB

Home Page: http://compgenomr.github.io/book

R 0.37% HTML 0.07% CSS 0.88% TeX 98.39% Shell 0.29%

book's People

Contributors

Stargazers

Watchers

Forkers

sgordon007 nmukherjee utsavkumar1 snikumbh htnani ravinsit06 jonathanronen qianqianliang jihao2019 tschalch peranti narayananr ruixiangliu xvazquezc ynren1020 daniahamo anhnguyendepocen hgwu80 frenkiboy mengxiaoqian kinali flower1996 sachingadakh cschu pythseq aliceishere maxmaronna jananiravi bixbeta rahulsmandal cpgme biaksang07 ramyar1993 balharbi btmonier paolozambonelli northnomad ko519 ohuajo bernaneon mcorentin kaziaa prathiba1 sehwanmoon timjostengit aghansah borauyar kasmiyassin yixf-self dpcrtsui taoxu94 chaoslure feigeliudan01 juzheng87 yun-xia yuzhong1997 nkm47 keyong-bio yangwang96 dxh8286239 yqsonggithub wanglei1989 flappysmurf bioyliu morganfuture minghao2016 guo-cheng mengqingren wangqq0515 arvinzoy ravindra-raut sbalci khaten calyx-commits gabohueck avinashkarn jp-rgb ik-afif yupopa drallayeh pandurang-kolekar zeeva85 zattael varun7248 nannanch albertorammz abwardh fabianegli magdalenengeve drejom icenzano jacobtiegs mil2041 xibalba laixn rubenmagni cwsmith022 chenyang666892 jeetayu zqw1103

book's Issues

Pdf is not available

Pdf is not available when I clicked the rightmost icon.

expression data section 4.1.1

I cannot find the code used to create 'df' (table for expression data of leukemia patients) in clustering section 4.1.1.

Typos & wording

5.1 Enumeration sometimes has periods at the end, sometimes not:

Define a prediction function or method f(X)
Devise a function (called loss or cost function) to optimize the difference between your predictions and observed values, such as ∑(Y−f(X))2
Apply mathematical optimization methods to find best parameter values for f(X) in relation to the cost/loss function.

5.1.1 "However, while doing so[, the] field of statistics developed..."
5.2 Enumeration sometimes has periods at the end, sometimes not.
"algorithm becomes relevant.[ ]“Training” generally"
5.4.2: library(caret) loaded to late/inconsistently
5.4.3: "Removing genes or samples have both downsides." Please ask English native speaker. Recommendation:
"Both, removing genes and samples have downsides"
5.5.1: "For starters, we will split the 30% of the data as test." Which the?
5.7: "Accuracy is the first metric to look at. This metric is is simply..." Double is.
# gte k-NN prediction on the training data itself, with k=5. gte?
5.12 "Another variable we can tune is the minimum node size of terminal nodes in the trees (min.node.size). This controls the depth of the trees grown. Setting this to larger numbers might cost a small loss in accuracy but the algorithm will run faster." Shouldn't it mean smaller numbers?

problems on R 4.1.0 on a DELL laptop

Text was 'cut and pasted' from electronic version of book at https://compgenomr.github.io/book/

A small amount of follow up carried out see if there was a simple explanation or work around but no attempt made to go much beyond what someone fairly new to R use might achieve

p182

fit logistic regression model

method and family defines the type of regression

in this case these arguments mean that we are doing logistic

regression

lrFit = train(subtype ~ PDPN,

           data=training, trControl=trainControl("none"),

           method="glm", family="binomial")

Error in eval(predvars, data, env) : object 'PDPN' not found

Strangely while not working with PDPN it worked with other genes e.g CBLN1 and DDX3Y

P209

require(rtracklayer)

session <- browserSession("UCSC",url = 'http://genome-euro.ucsc.edu/cgi-bin/')

genome(session) <- "mm9"

choose CpG island track on chr12

query <- ucscTableQuery(session, track="CpG Islands",table="cpgIslandExt",

    range=GRangesForUCSCGenome("mm9", "chr12"))

Error in GRangesForGenome(genome, chrom = chrom, ranges = ranges, method = "UCSC", :

Failed to obtain information for genome 'mm9'

get the GRanges object for the track

track(query)

Error in h(simpleError(msg, call)) :

error in evaluating the argument 'object' in selecting a method for function 'track': object 'query' not found

P211

library(genomation)

Warning message:

replacing previous import ‘Biostrings::pattern’ by ‘grid::pattern’ when loading ‘genomation’

filePathPeaks=system.file("extdata",

          "wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz",

                  package="compGenomRData")

read the peaks from a bed file

pk1.gr=readBroadPeak(filePathPeaks)

Error: No such process

get the peaks that overlap with CpG islands

subsetByOverlaps(pk1.gr,cpgi.gr)

Error in h(simpleError(msg, call)) :

error in evaluating the argument 'x' in selecting a method for function 'subsetByOverlaps': object 'pk1.gr' not found

P217

library(rtracklayer)

File from ENCODE ChIP-seq tracks

bwFile=system.file("extdata","wgEncodeHaibTfbsA549.chr21.bw",package="compGenomRData")

bw.gr=import(bwFile, which=promoter.gr) # get coverage vectors

Error in .local(con, format, text, ...) : UCSC library operation failed

In addition: Warning message:

In .local(con, format, text, ...) : Invalid argument

lseek(3, 844957, invalid 'whence' value (1822621639)) failed

Leading to subsequent errors in rest of section

P225

gene.track <- BiomartGeneRegionTrack(genome = "hg19",

                                chromosome = "chr21",

                                start = 27698681, end = 28083310,

                                name = "ENSEMBL")

Error in gzfile(file, mode) : cannot open the connection

Leading to subsequent errors in rest of section

P239

library(Rqc)

folder = system.file(package="ShortRead", "extdata/E-MTAB-1147")

feeds fastq.qz files in "folder" to quality check function

qcRes=rqc(path = folder, pattern = ".fastq.gz", openBrowser=FALSE)

Error in file(file, ifelse(append, "a", "w")) :

cannot open the connection

In addition: Warning messages:

1: In normalizePath(path.expand(path), winslash, mustWork) :

path[1]="C:\Users\david\AppData\Local\Temp\Rtmpg7tnGG": The system cannot find the file specified

2: In (function (filename = if (onefile) "Rplots.svg" else "Rplot%03d.svg", :

cairo error 'error while writing to output stream'

3: In file(file, ifelse(append, "a", "w")) :

cannot open file 'C:\Users\david\AppData\Local\Temp\Rtmpg7tnGG/rqc_report.md': No such file or directory

rqcCycleQualityBoxPlot(qcRes)

Error in h(simpleError(msg, call)) :

error in evaluating the argument 'x' in selecting a method for function 'perCycleQuality': object 'qcRes' not found

Leading to subsequent errors in rest of section

P243

install.packages("astqcr")

Installing package into 'C:/Users/david/Documents/R/win-library/4.1'

(as 'lib' is unspecified)

Warning: unable to access index for repository https://cran.ma.imperial.ac.uk/src/contrib:

cannot open destfile 'C:\Users\david\AppData\Local\Temp\Rtmpg7tnGG\file1eb47b7e36f1', reason 'No such file or directory'

Warning: unable to access index for repository https://cran.ma.imperial.ac.uk/bin/windows/contrib/4.1:

cannot open destfile 'C:\Users\david\AppData\Local\Temp\Rtmpg7tnGG\file1eb47c802d57', reason 'No such file or directory'

Warning message:

package 'astqcr' is not available for this version of R

p245

write out fastq file with only reads where all

quality scores per base are above 20.

writeFastq(fq[qcount == 0],

       paste(fastqFile, "Qfiltered", sep="_"))

Error: UserArgumentMismatch

P270

plotPCA(countsNormalized[selectedGenes,],

    col = as.numeric(colData$group), adj = 0.5,

    xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.6))

Error in (function (classes, fdef, mtable) :

unable to find an inherited method for function 'plotPCA' for signature '"matrix"'

Figure 11.9 is mis-attributed

FIGURE 11.9: A heatmap of NMF factors shows the separability of tumors into subtype clusters. This plot is more useful than a scatter plot when there are more than two factors.

This figure is misattributed as per the code shown

Screenshot attached

Typos on chapter 3

Hi,
I just want to report typos you may have missed:
Chapter 3 > 3.1.2 Describing the spread: measurements of variation: In the probability section :
You have written :

In this case, what we want is the are under the curve shaded in blue. To be able to that we need to integrate the probability density function but we will usually let

And then in the following paragraph :

After calculating the Z-score, we can go look up in a table, that contains the area under the curve for the left and right side of the Z-score, but again we use software for that tables are outdated.

Thank you so so much for such useful content!

Matching corrplot and pheatmap in Section 8.3.6.3

I accidentally ran by your bookdown when I searched for how to display correlation matrix with hierarchical clustering tree. I noticed that your corrplot(correlationMatrix, order = 'hclust', addrect = 2) plot doesn't match with your pheatmap below in terms of variables' order and clustering. It's because in corrplot, the function takes the correlation matrix as a distance matrix and runs hclust directly on it. Meanwhile, pheatmap considers the correlation matrix as a normal data set and re-calculates the distance matrix before feeding it into hclust.

To make the two plots consistent with each other, I suggest changing pheatmap function to add two arguments (clustering_distance_rows and clustering_distance_cols) to it. It basically tells pheatmap to use the current correlation matrix as the distance matrix. The 1 - is to ensure that perfect positive correlation (1) is considered as min distance and perfect negative correlation (-1) is considered as max distance.

pheatmap(correlationMatrix, 
         clustering_distance_rows = as.dist(1 - correlationMatrix), 
         clustering_distance_cols = as.dist(1 - correlationMatrix))

Error in validObject(.Object) : invalid class "ScoreMatrix" object: superclass "mMatrix" not defined in the environment of the object's class

Hello

The code of line 530-558 in the 06-genomicIntervals.Rmd can't run successfully:

# get transcription start sites on chr20
library(genomation)
transcriptFile=system.file("extdata",
                      "refseq.hg19.chr20.bed",
                      package="compGenomRData")
feat=readTranscriptFeatures(transcriptFile,
                            remove.unusual = TRUE,
                            up.flank = 500, down.flank = 500)
prom=feat$promoters # get promoters from the features


# get for H3K4me3 values around TSSes
# we use strand.aware=TRUE so - strands will
# be reversed
H3K4me3File=system.file("extdata",
                      "H1.ESC.H3K4me3.chr20.bw",
                      package="compGenomRData")
sm=ScoreMatrix(H3K4me3File, prom,
               type="bigWig", strand.aware = TRUE)

Error in validObject(.Object) : 
  invalid class "ScoreMatrix" object: superclass "mMatrix" not defined in the environment of the object's class

How should I solve this?
Thanks

Missing license in repository root

While the license is clearly indicated on the landing page of the book it is missing in the repository root. Please consider adding the license.

Download PDF version

I am trying to download the PDF version of the book, but I keep getting a 404 page saying

There isn't a GitHub Pages site here.

edit stats exercise question for clarity

Stats chapter:
How does the estimate from the random samples change if we simulate more data with data=matrix(rnorm(6000,mean=200,sd=70),ncol=6)

should be

How does the estimate from the random samples change if we simulate more data with data=matrix(rnorm(6000,mean=200,sd=70),ncol=6) keeping the number of samples per dataset constant, as n=6.

unsupervised learning chapter
reconstruction question should be:
Our next tasks are to remove eigenvectors and reconstruct the matrix using SVD, then calculate the reconstruction error as the difference between original and reconstructed matrix. Remove a few eigenvectors, reconstruct the matrix and calculate the reconstruction error. Reconstruction error can be euclidean distance between original and reconstructed matrices.

used d() instead of fread(), change this to fread()

book/02-intro2R.Rmd

Line 359 in 08d3028

df.f=d(enhancerFilePath, header = FALSE,data.table=FALSE)

Unable to Open in PDF Format

Hello! I'm unable to open this in a PDF format, and instead have to access the book through the web interface. This is fine, but there is a PDF link/icon at the top right of the text, and it takes you to a 404 error.

compgenomr / book Goto Github PK

book's People

Contributors

Stargazers

Watchers

Forkers

book's Issues

fit logistic regression model

method and family defines the type of regression

in this case these arguments mean that we are doing logistic

regression

choose CpG island track on chr12

get the GRanges object for the track

read the peaks from a bed file

get the peaks that overlap with CpG islands

File from ENCODE ChIP-seq tracks

feeds fastq.qz files in "folder" to quality check function

write out fastq file with only reads where all

quality scores per base are above 20.

Recommend Projects

Recommend Topics

Recommend Org