Giter VIP home page Giter VIP logo

long_read_processing_tutorial's People

Contributors

jordenrabasco avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

long_read_processing_tutorial's Issues

[Section] - Graph of ASVs overtime

```{r Pacbio 11) Ecoli: over time Pacbio, message = FALSE, warning = FALSE}
ecm$SampleOrder <- df$SampleOrder[ecm$Sample]
ecm$Timepoint <- paste("Timepoint", ecm$SampleOrder)
ecm$Subject <- df$Subject[ecm$Sample]
ecm$SubjectLabel <- paste("Subject", df$Subject[ecm$Sample])
ecmp <- ecm[ecm$Sample %in% c("R_3.1", "R_3.2", "R_3.3", "R_9.1", "R_9.1B", "R_9.2"),]
ecmp$SequenceVariant <- as.character(ecmp$SequenceVariant)
ecmp <- ecmp[ecmp$Abundance > 0,]
xx.R3 <- c(Ec1=1, Ec2=2, Ec3=3, Ec4=4, Ec5=5) # Strain 1
PAD <- 8 # Between strain pad
xx.R9 <- c(Ec2=1+PAD, Ec12=2+PAD, Ec13=3+PAD, Ec14=4+PAD, # Strain 2
Ec6=1+2*PAD, Ec7=2+2*PAD, Ec8=3+2*PAD, Ec9=4+2*PAD, Ec10=5+2*PAD, Ec11=6+2*PAD) # Strain 3
ecmp$X <- 0
is.R3 <- ecmp$Subject == "R3"; ecmp$X[is.R3] <- xx.R3[ecmp$SequenceVariant[is.R3]]
is.R9 <- ecmp$Subject == "R9"; ecmp$X[is.R9] <- xx.R9[ecmp$SequenceVariant[is.R9]]
# Force desired facet ymax limits with a custom data.frame
dflim <- data.frame(X=c(1, 1), Abundance=c(3000, 1500),
SequenceVariant=c("Ec1", "Ec1"),
SubjectLabel=c("Subject R3", "Subject R9"), Timepoint=c("Timepoint 1", "Timepoint 1"))
p.ecoli <- ggplot(data=ecmp, aes(x=X, y=Abundance, color=SequenceVariant)) + geom_point() +
facet_grid(SubjectLabel~Timepoint, scales="free_y") +
xlab("E. coli ASVs") + theme(axis.ticks.x=element_blank(), axis.text.x = element_blank()) +
theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
geom_blank(data=dflim) +
guides(color=FALSE)
p.ecoli
```

Descripting sample data

Here I talk about the sequencing data used in the tutorial. I guess is this an okay way to do this?

In the sequencing world there are many different technologies that have advantages and disadvantages, which should all be considered before a sequencing run has begun. Short read sequencing technologies, such as illumina pair-wise sequencing, have been at the head of the field for a number of years. Illumina sequencing can give an accurate sequence but is constrained in size, with the reads being anywhere from 200-300bp. The advent of long read sequencing attempts to solve this length constraint. Pacbio one of the leaders in the long read sequencing field can generate reads much longer than traditional short read sequencing but with an increaced error rate. To solve this techniques such as loop seq have been implemtented which sequence a genetic region many times and then creates a consensus sequence in order to bring down the per base error rate. However this again creates a limitation in the size of the locus that is being sequenced. To address this and other limitations bioinformatic tools and workflows, such as those included here, have been created. This tutorial is intended for those that wish to analyze their long read sequences, generated from Pacbio techniques. The output from this workflow will consist of assigned taxonomies based on the ASVs generated by dada2. This tutorial assumes a working version of R 4.2.0
Lets begin.
This tutorial utilizes data from "High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution" by Callahan et al. Nucleic Acids Research, 2019c which can be found here <https://academic.oup.com/nar/article/47/18/e103/5527971?login=true>. The data itself consists of sequencing measurements of a set
of nine human fecal samples, from 3 different subjects. Additionally pacbio sequencing of a zymo mock community was included as a benchmark.

Pathing set up

As requested here is the pathing base path for the whole tutorial. All the user will have to do is change that path and they should be good. All subsequent pathing is included in the block of code where it is used.

This section sets up the pathing for the tutorial and its output. When you first download the tutorial zip file, you will need to change the "base_path" variable to wherever the tutorial zip file is downloaded to.
Note: When running your own data through this tutorial, it would be prudent to set up your data in the "data" folder of the tutorial and remove the test data present in there. This will prevent any needless errors from altered pathing to occur.
```{r Pathing Set-up Change, message = FALSE, warning = FALSE}
base_path <- "C:/Users/jorde/OneDrive/Desktop/github/Long_read_processing_tutorial" #Informs all other pathing in the tutorial
```

ending link

I linked your other tutorial that does a more in-depth specific analysis of the fecal dataset in case that is more in line with what the users' needs.

Other detailed plots and vizuations can be seen here <https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html>

Description of Pacbio reads and why we need filtering

Here is the description of Pacbio reads and why the filtering/trimming steps are important

Here we will be removing the primers and filtering out poor quality reads. We want to remove the primers as they were not part of the original sequence and thus can alter and return wrong ASVs or taxonomic assignments. Furthermore, when you have your reads in a fastq style format which is what is assumed in this workflow, each read has both an id and a sequence of quality indicators. These indicators show how sure we are that each base was identified correctly. That is to say that with low quality bases some ambiguity in the identity of the base, in that position is inherent. Therefore, we want to filter out these poor quality reads from reads that have a higher quality to ensure that the results are accurate. These two steps will ensure the quality of the reads and assist in downstream taxonomic assignments. Additionally, HiFi read orientation is mixed, meaning that some reads are in the 3' to 5' orientation and some are in the 5' to 3' orientation. This is corrected in the primer removal step, where all the reads are made to be in the same orientation. Due to this mixed orientation we will also need to reverse complement the reverse primer which can be seen in the "removePrimers" step. This is done as when the DNA amplicon is being generated the reverse primer attaches to the template DNA on the opposite strand which is in the reverse complement orientation. This means that to identify the reverse primer location we will need to look for the reverse complement of it. To ensure quality and to act as a sanity check we will want to visualize the length of the sequences after trimming to ensure that the sequences are predominantly of the lengths that we expect.

Library loading and checking

When I load the dada2 package I check the version in tutorial. For the remaining packages used in the tutorial used for graphs and whatnot, I loaded them but don't display them in the print out. Is this okay?

```{r, message = FALSE, warning = FALSE}
library(dada2);packageVersion("dada2")
```
```{r, message = FALSE, warning = FALSE, include = FALSE}
library(Biostrings);packageVersion("Biostrings")
library(ShortRead);packageVersion("ShortRead")
library(ggplot2);packageVersion("ggplot2")
library(reshape2);packageVersion("reshape2")
library(RColorBrewer);packageVersion("RColorBrewer")
library(BiocGenerics);packageVersion("BiocGenerics")
library(S4Vectors);packageVersion("S4Vectors")
library(Biostrings);packageVersion("Biostrings")
library(Biobase);packageVersion("Biobase")
library(MatrixGenerics);packageVersion("MatrixGenerics")
```

[Section] Dereplication and sanity check

This is the introduction to the dada2 method and an explanation and sanity check of the dereplication. Let me know what you think!

The next step is arguable the most important part of the tutorial; denoising via dada2 package. This was already loaded into the R environment in the beginning of the workflow so there is no need to worry. We will also want to produce error plots and an ASV table which we can then use for taxonomic assignment. The first step in this process is to dereplicate amplicon sequences. Then we will learn error rates, save them in an R object, and plot the output error graph. If for any reason you need to stop the tutorial, the R object can then be loaded in and the workflow continued from this step.
To check to see if the sequences were dereplicated correctly we can look at some of the unique sequence counts in one of the samples.
```{r Pacbio 3A) Denoising Pacbio pt1, message = FALSE, warning = FALSE}
drp <- derepFastq(filt, verbose=TRUE)
head(drp$R11_1_P3C3.fastq.gz$uniques)
```
As we can see the list of unique sequences and their counts were generated. We can also see that there is a significant relative abundance of these unique sequences. This allows the "learnerrors" module to run appropriately. If there wasn't enough abundance for each unique amplicon then the error model wouldn't run correctly. However, this doesn't seem to be the case here and we can therefore assume that the dereplicaiton procedure was a success! If you wish to check the other samples you can switch the sample name in the code "head(drp$R11_1_P3C3.fastq.gz$uniques)" to whichever sample you would like to view.

[Section] Taxonomy identification

After the dada2 algorithm has run I move into talking about assigning those output ASVs taxonomies. I include the link to download the trained data as well as mentioning why the trained data is important.

## Assigning Taxonomy
Now that we have processed our data via dada2, we can plot and pull some interesting conclusions from the data. The first thing to be done is assign taxonomic assignments to the generated ASVs. This is done by utilizing the assignTaxonomy method innate to the dada2 package. This method uses trained data to inform the algorithm which taxa belong to which taxonomic assignments. To accomplish this we need a source taxonomy file which for this sample dataset can be found here: <https://zenodo.org/record/801832/files/silva_nr_v128_train_set.fa.gz?download=1>. You will then have to move this file into the tax folder found within the tutorial folder.
Then we run the main function and voilà! We have taxonomies!
```{r Pacbio 6) Assigning Taxonomy Pacbio, message = FALSE, warning = FALSE}
tax_path <- file.path(base_path, "long_read_tutorial/tax/silva_nr_v128_train_set.fa.gz")
tax <- assignTaxonomy(st, tax_path, multithread=TRUE)
tax[,"Genus"] <- gsub("Escherichia/Shigella", "Escherichia", tax[,"Genus"]) # Reformat to be compatible with other data sources
head(unname(tax))
```

[Section] Denoising

Here I talk about the main dada2 algorithm as well as special options that the user can change. I also include some read tracking after the dada2 algorithm so that the user can keep track of where all of their reads are going

Now that the error rates are learned we will finally run the star of the show dada2; utilizing both of the output R objects generated from our previous steps. There are a myriad of options included in the dada2 package, the most important of which are included in the description below but in the negative (option is not turned on). To turn on these options follow the directions included in the description of each of the options.
The first of the a fore mentioned options is the "pool" option. In this option you can decide whether or not to pool your samples when running dada2. This allows for sharing information across samples and can increase the specificity of the dada2 algorithm to resolve ASVs that maybe in low abundance. In order to turn on this option you will need to change "pool=FALSE" to "pool=TRUE". You can also use the option "pool=psuedo", which inputs all ASVs detected in at least two samples in the first sample processing step as priors to the second step.
Another option to mention is the "OMEGA_A" option which sets the threshold for when DADA2 calls unique sequences significantly overabundant, and therefore creates a new partition with that sequence as the center. Default is 1e-40, which is a conservative setting to avoid making false positive inferences, but which comes at the cost of reducing the ability to identify some rare variants. To change this parameter to be less conservative and more sensitive to detecting rare variants change the number in "OMEGA_A=(1*10^-40)" to a larger number.
The last option that we will mention here is "DETECT_SINGLETONS", which toggles reporting of ASVs which originated from one read (default is to only report ASVs that originate from 2 or more reads). When turned on this option will increase the sensitivity of the dada2 algorithm particularly in low abundance environments. To turn this option on change "DETECT_SINGLETONS=FALSE" to "DETECT_SINGLETONS=TRUE".
For more in-depth information regarding these options you can read the dada2 documentation found here: <https://www.bioconductor.org/packages/devel/bioc/manuals/dada2/man/dada2.pdf>
For examples of how these options work you can look at the sensitivity dada2 workflow found here:
<https://benjjneb.github.io/dada2/pseudo.html>
```{r Pacbio 4) Denoising Pacbio pt3, message = FALSE, warning = FALSE}
dd <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE, pool=FALSE,OMEGA_A=(1*10^-40), DETECT_SINGLETONS=FALSE) # seconds
saveRDS(dd, file.path(base_path, "long_read_tutorial/Pacbio/RDS/Fecal_dd.rds"))
cbind(ccs=prim[,1], primers=prim[,2], filtered=track[,2], denoised=sapply(dd, function(x) sum(x$denoised)))
st <- makeSequenceTable(dd); dim(st)
```
As you can see the output table above shows the read tracking through the various steps of the workflow up until this point, such as read numbers for ccs, primer containing, filtered, and denoised reads.

[Section] Ending Statement

Temporary ending statement. I was unsure of how to end it, what are your thoughts?

This now concludes our tutorial! Throughout this workflow we have taken raw sequencing data, processed it using cutting edge methods, and then visualized that data in a pleasing and nuanced manner. Now you are ready to go out into the world of long read analysis and try this workflow on your own data. Good luck out there!

[Section] Data management

This is a data management section. I essentially massage the data to make it more readable for later visualizations. Should this be its own section or should I merge it with the later sections as there is no real biological significance to changing titles and such?

## Data Management
In this next section we will be extracting the file sample names for downstream visualizations and then saving the processed data as R objects for future analysis. This will allow easy access for future workflows.
```{r Pacbio 7) Data clean up Pacbio, message = FALSE, warning = FALSE}
sample.names <- sapply(strsplit(fn_r, "/"), function(x) paste(x[length(x)], sep="_"))
sample.names <- sapply(strsplit(sample.names, "_"), function(x) paste(x[1],x[2], sep="_"))
sample.names<- gsub("_", ".", sample.names)
sample.names<- gsub("^R", "R_", sample.names)
sample.names <- gsub(".fastq.gz", "", sample.names)
rownames(st) <- sample.names
sample.names
#save R objects
saveRDS(st, file.path(base_path, "long_read_tutorial/Pacbio/RDS/Fecal_st.rds"))
saveRDS(tax, file.path(base_path, "long_read_tutorial/Pacbio/RDS/Fecal_tax_Silva128.rds"))
#relaod R objects
st <- readRDS(file.path(base_path, "long_read_tutorial/Pacbio/RDS/Fecal_st.rds"))
tax <- readRDS(file.path(base_path, "long_read_tutorial/Pacbio/RDS/Fecal_tax_Silva128.rds"))
```

[Section] Error Model

This section explains the error model needed for the daad2 algorithm, makes an error plot, and explains the plot below.

The DADA2 algorithm makes use of a parametric error model (err) which generates different sets of error rates for every amplicon in the dataset. The "learnErrors" method learns this error model from the data, by alternating the estimation of the error rates and conducting an inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates for this data are present (the error rates in the case where the most abundant sequence is correct and all the rest of the sequences are errors).
```{r Pacbio 3B) Denoising Pacbio pt2, message = FALSE, warning = FALSE}
err <- learnErrors(drp, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE) # 10s of seconds
saveRDS(err, file.path(base_path, "long_read_tutorial/Pacbio/RDS/Fecal_err.rds"))
plotErrors(err)
```
The error rates for each possible transition (A→C, A→G, …) are shown in the plot above. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.

Ecoli - one sample graph

Here I show a graph for just one sample so that the user can identify all the different Ecoli id'd strains from that sample.

Lets look a bit closer at R_9.1, shall we?
```{r Pacbio 9) Ecoli: In depth view Pacbio, message = FALSE, warning = FALSE}
isam <- "R_9.1"
plot<-ggplot(data=ecm[ecm$Sample == isam,], aes(x=SequenceVariant, y=Abundance, fill=SequenceVariant)) +
geom_col(width=0.4) + scale_y_continuous(breaks=seq(0,250,10)) +
ggtitle(paste("Sample", isam) )
plot + theme(axis.text = element_text(size = 6))
```
With high confidence, this suggests a high abundance strain with a 4:1:1:1 full complement of Ec2:Ec13:Ec14. With moderate confidence, this suggests a lower abundance strain with a 2:1:1:1:1:1 full complement of Ec6:Ec7:Ec8:Ec9:Ec10:Ec11. And, in fact, that is the strain that subject R9 has in the next time-point!

Ecoli-visualization

This section is visualizing the ecoli identified ASVs and the different strains there in. There is also an explanation after the generated graph

Now we can visualize the distribution of the E.coli variants
```{r Pacbio 8) Visualize Ecoli distribution Pacbio, message = FALSE, warning = FALSE}
ecdf <- data.frame(st[,is.ecoli])
names(sqec) <- paste0("Ec", seq(ncol(ecdf)))
ecnames <- names(sqec); names(ecnames) <- sqec # map to ecnames from sequences
colnames(ecdf) <- names(sqec)
ecdf <- cbind(ecdf, Sample=rownames(st))
ecm <- melt(ecdf, id.vars="Sample", value.name="Abundance", variable.name="SequenceVariant")
plot<-ggplot(data=ecm, aes(x=Sample, y=Abundance, color=SequenceVariant)) + geom_point()
plot+theme(axis.text = element_text(size = 8))
```
Very clear 3:1:1:1:1 full complement signal in R_3, consistent over the time-course from R_3.1 to R_3.2 to R_3.3. R_9.1B also has a clear 2:1:1:1:1:1 full complement. R_9.1 is less clear because of the lower abundances, but may have 2 distinct strains given the 10 total E. coli variants in that sample (see previous code block). Note that R_9.1 precedes R_9.1B from the same subject.

[Section] Ecoli ASV identification

Here I show how to identify which ASVs belong to specific taxa. In this case Ecoli was identified since that is what you used in your previous tutorial.

## Inspect E.coli
In this tutorial dataset let’s see if we can achieve reconstruction of E. coli strains. This is done by retrieving the sequences of the ASVs and then checking those sequences against their taxonomic assignments. This is then printed out in a tabular format.
```{r Pacbio 9) Identify Ecoli sequences Pacbio, message = FALSE, warning = FALSE}
sq <- getSequences(st)
is.ecoli <- tax[,"Genus"] %in% "Esherichica/Shigella"
sqec <- sq[is.ecoli]
which(is.ecoli)
rowSums(st[,is.ecoli]>0)
```
There are several ASVs that are identified as E.coli but since we expect 3-6 unique alleles per strain, there are probably only 3-4 strains present.

[Section] Chimeras

Here I talk about chimera removal from the sample and why we need it. I also include an output table and a sanity check after the table.

## Identifying and Removing chimeras
While the core dada method corrects substitution and indel errors, chimeras still remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. First we need to identify how many bimeras there are (if any), as well as the percentage of them in the overall reads:
```{r Pacbio 5) Bimera identification, message = FALSE, warning = FALSE}
bim <- isBimeraDenovo(st, minFoldParentOverAbundance=3.5)
# Higher MFPOA to avoid flagging intra-genomic variants
table(bim)
sum(st[,bim])/sum(st)
```
As you can see there are 38 bimera present in the dataset, which comes out to a 0.6316% bimera rate in the reads. A relatively low bimera rate is not to be unexpected and you should only worry if the bimera rate is relatively high

Metadata - fecal

Here I load the metadata for the fecal dataset...I don't use the loaded data after this section, but I keep it in case we wanted to add something later on that would utilize the metadata...should I keep it in?

## Sample metadata loading
Import the metadata for the samples to facilitate downstream processing and visualizations
```{r Pacbio 8) Sample Metadata Pacbio, message = FALSE, warning = FALSE}
pac_path_metadata<-file.path(base_path, "long_read_tutorial/fecal_metadata.csv")
ft <- sweep(st, 1, rowSums(st), "/")
df<-read.csv(file = pac_path_metadata)
df$SampleID <- gsub("_", ".", df$X)
df$X<- gsub("_", ".", df$X)
df$SampleID <- gsub("^R", "R_", df$SampleID)
df$X <- gsub("^R", "R_", df$X)
rownames(df) <- df$SampleID
#df <- df[sample.names2,-1]
head(df)
```

Ecoli- abundance counts

This section prints off a table that has the abundance counts of the Ecoli identified strains in a particular sample. I included this to show the user how to pull out exact counts as I thought that maybe useful.

The exact abundance of each strain in the sample can then be visualized in data form:
```{r Pacbio 10) Ecoli: In depth view Pacbio, message = FALSE, warning = FALSE}
ecdf["R_9.1",]
```

Primer removal and filtering sanity check

These lines and the table generated are meant as a sanity check for after filtering/primer removal steps. Do these need to be more expansive and descriptive of what they should be expecting or are they fine as is?

There is a strong peak around 1450, so that looks good. Next we are going to filter the reads for quality.
```{r Pacbio 2) read tracking, message = FALSE, warning = FALSE}
filt <- file.path(base_path,"long_read_tutorial/Pacbio/filtered", basename(fn_r)) #generates file in the filtered folder
track <- filterAndTrim(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2)
track
```
Given in the table are the reads fed into the filtering step and the output reads after the filtering step. If the numbers in each column are the same then no reads were removed due to quality, however, it is not uncommon that a few reads will be removed due to quality concerns. Note: when running your own data you will need to change parameters "minLen" and "maxLen" to encompass the length range of your expected amplicon. This will filter out any reads that have significant replication errors or vast changes in the content of the amplicon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.