Comments (15)
Yes I spent a couple of days thinking about that last year, going through the Salmon and Kallisto preprints to try to see how their models would relate to the UMI-Tags case.
One thing that is immediate is that for tag counting a lot of the coverage parameters becomes simpler, since we only expect the end tags cDNA to be sampled, and that case seems pretty simple to implement. You can probably sort of hack it in by changing your reference transcriptome to just the last ~200 bases of each transctipt before you make the index.
That does not use any information from UMIs though. The computer science problem that one essentially solves when counting UMIs (assuming only unique mappings) is the Count-distinct problem. To solve this problem, I think one needs to find a way to combine the relative abundance estimation problem with the count-distinct problem. I can't figure out how to do that in a sensible way.
You have to solve a discrete problem using continuous information I think.
Another way of looking at it would be to make a Salmon-like algorithm and use the UMI's to estimate some PCR duplication auxiliary parameter per transcript. Then at the end multiply the relative abundance measure (then corrected for duplication) with the total observed/possible number of UMIs.
I'm copying in @rob-p because he might have some ideas and I think it's a discussion that would be interesting to have in public where other people can chime in.
I'm also attaching my (kind of legible... ) notes on this in the notation of the Salmon preprint (I think)
from umis.
Hi @vals,
What do you think about pre-filtering the BAMs to drop all PCR duplicates? Then there wouldn't be a UMI problem anymore and we could maybe split the BAM by cellular barcode and run salmon on the split BAMs?
from umis.
When you're mapping the read, in essence, you are assigning the UMI connected to that read to the transcript it maps to.
Say you have two reads, and the mapping assigns the UMI to transcripts in this way:
(read_1, umi_1) -> (tx_1, umi_1), (tx_2, umi_1), (tx_3, umi_1)
(read_2, umi_1) -> (tx_2, umi_1), (tx_3, umi_1), (tx_4, umi_1)
We don't know from where umi_1
was duplicated. When you filter this, you would remove one copy of each of the reads due to tx_2
or tx_3
evidence. But on the other hand not due to the other two transcripts. If you would convert this deduped bam back to fastq, the two reads would end up being four reads...
This mapping ambiguity is why you need to solve the relative abundance problem. In Ntranos et al they get around this problem by considering all the transcripts a read can map to (a mapping equivalence class) as a single thing.
from umis.
Hi guys,
I feel like, perhaps, I still don't understand the formal problem definition well enough. We are given a collection of N
fragments containing N'
umi tags where N' <= N
; is that right? Now what, precisely, is our goal? Is it simply that we want to assign UMIs (uniquely or proportionally) to cDNAs? If so, what is the meaning of the duplicated UMIs (i.e. if we observe the same UMI in two or more different reads)? For example, in the example given above, do we want to consider (read_1, umi_1)
and (read_2, umi_1)
as one occurrence of umi_1
for the purposes of allocation? Sorry for my naĂŻvetĂŠ on this front, but I just want to understand the formal problem solidly before trying to think about a solution.
from umis.
Hey Rob,
I'll try to provide some background.
I think the general reason for sequencing end tags rather than fragments from the entire transcript is to save sequencing cost. You need to sequence less per sample to reach saturation since the libraries are much less complex. Unlike that RNA-Seq case, fragments arising from transcripts will not scale due to length of transcript.
The problem comes from sequencing end tag sequences with small amounts of starting material, when a very large number of PCR cycles (or IVT) is needed. But some cDNA-tags will be preferentially amplified. Fragment amplification bias is quite a large concern in RNA-seq, but even more so when there is such low complexity in the starting material in the tag-sequencing case.
To try to counteract this, each cDNA-tag is labeled with a random k-mer called a UMI, (where k ranges from 4 to 32 depending on implementation). If k = 6, every cDNA-tag in the sample has the possibility of being labeled in 4,096 ways. When amplifying, the (UMI, cDNA-tag) tuples gets amplified together. So, say that we sequence this amplified library of (UMI, cDNA-tag) tuples. If we observe 1,000 cDNA-tag-fragments, and they have 10 unique UMI's, each observed 100 times, we would infer that we actually started with 10 cDNA-tags, which were amplified 100 times.
In an ideal world, we could solve this by going through the fragments in a FASTQ file, and remove every duplicated (UMI, fragment) pair. But both during amplification and sequencing errors can be introduced in the fragments, so exact matching is the wrong thing to do. Also the RT step which produces the tags is not exact, and the cDNA-tags might be a bit shifted from the very end of transcript.
So for that strategy we would have to do some clustering / fuzzy matching of fragments, then count unique UMI's in each cluster of fragments.
The strategy here, is that since we know that the cDNA-tags arise from the transcriptome, fuzzy matching is treated as the mapping to transcripts. So we transfer a umi from (umi, fragment) to (umi, transcript). Now the counting is easy, as we know a-priori the transcripts. However, because of mapping ambiguity, the umi gets transferred to multiple transcripts.
The problem is, how do we allocate these counts to the individual transcripts given that ambiguity? (Which later get collapsed to "genes").
In the example above we want to consider (read_1, umi_1)
and (read_2, umi_1)
as one occurrence if the sequence of read_1
is sufficiently similar to that of read_2
. If the situation was
(read_1, umi_1) -> (tx_1, umi_1)
(read_2, umi_1) -> (tx_1, umi_1)
we would count it as on occurrence of tx_1
in the sample. But if
(read_1, umi_1) -> (tx_1, umi_1)
(read_2, umi_2) -> (tx_1, umi_2)
we would count it as two occurrences. Similarly if
(read_1, umi_1) -> (tx_1, umi_1)
(read_2, umi_1) -> (tx_2, umi_1)
we would infer that both tx_1
and tx_2
had been observed once.
In the current counting implementation in this package, if we have
(read_1, umi_1) -> (tx_1, umi_1), (tx_2, umi_1)
(read_2, umi_1) -> (tx_1, umi_1), (tx_2, umi_1)
each of tx_1
and tx_2
gets 0.5 "observation" or "evidence". If
(read_1, umi_1) -> (tx_1, umi_1), (tx_2, umi_1)
(read_2, umi_1) -> (tx_2, umi_1), (tx_3, umi_1)
then each of tx_1
, tx_2
, tx_3
gets 0.5 "evidence". This is were things go wrong I think. And the problem is, how do we allocate the umi_1
observation evidence to tx_1
, tx_2
and tx_3
?
from umis.
Thanks Valentine,
This explanation is very clear, and I'll think about this more deeply. My first thought, though, is, would it be possible to do what Sailfish / Salmon normally do in terms of equivalence classes, but simply "adjust" the equivalence class counts by removing duplicate UMI tags? Basically, the idea is that the count of reads in an equivalence class would be equal to the number of distinct UMIs in that class, rather than the actual count of fragments. Given this information, you could then run the EM (or VBEM) "normally" to allocate these counts to the transcripts. One caveat here is since the equivalence classes forget the sequence of the reads they contain, I suppose you could have a situation where two different reads having the same UMI fall into the same equivalence class, and therefore get discounted. Otherwise, thought, I think the general approach would do the "right thing".
--Rob
from umis.
Yes I think a mode like that for Sailfish / Salmon would be very appropriate. With the other assumption based on the data that "effectiveLength" should be set to a constant value for each transcript in the model. And to get absolute counts at the end, the numReads can be based on the number deduplicated reads rather than all input reads.
from umis.
@rob-p ... I just realised that such a mode in Sailfish / Salmon could be referred to as a "caviar" mode.. (because... single cells...)
from umis.
đ --- loves it!
from umis.
here is some prior art: https://github.com/CGATOxford/UMI-tools
details here
from umis.
@flying-sheep I'm aware of those, I wrote these scripts because aligning to a genome is a waste of time and resources. And the script to extract barcodes and UMIs was not general enough to handle all the different kinds of data sets I was working on.
But this issue in particular is about considering things which those tools completely ignore, which is uncertainty of the source of a tag.
from umis.
oh, sorry, i didnât read the whole thing and didnât check where âweighting by number of hitsâ comes into play. thanks for the clarification. maybe you should make it clear in the issue title?
thanks for building this, its functionality is perfect for me! might have to make it faster though.
your choice of tools and libraries seems to match mine pretty well, so if you thought about writing in a faster language, you might also enjoy the Rust programming language.
from umis.
@flying-sheep Sorry for the dismissive tone that I'm catching rereading my response...
There's another issue (#5) which speaks about the UMI-merging problem.
Let me know if you manage to make it faster, I benchmarked several SAM parsing libraries to find the fastest one (hts-python, simplesam, and pysam), and found pysam, which is written in C, to be fastest. And generally, profiling indicates it's IO bound.
I'm learning Julia at the moment, Rust seems a bit strict to me.
from umis.
Sorry for the dismissive tone that I'm catching rereading my response...
none perceived, donât worry.
Let me know if you manage to make it faster
if anything, iâll rewrite it using rust â eventually, itâs not a priority right now. handling every single read with python is slooow.
I'm learning Julia at the moment, Rust seems a bit strict to me.
also a good choice iâm sure! i gathered itâs basically a faster, non-ugly, open source MATLAB, or an R without the lazy evaluation insanity and the feeling of wading through tar. ugh, MATLAB⌠seriously, if a language requires me to write argument parsers instead of using keyword arguments, someone didnât have a sense for priorities.
from umis.
Closing this as it is basically handled by the kallisto single cell mode.
from umis.
Related Issues (20)
- is there any function to extract fastq for certain cells passing the filter? HOT 1
- During Filtering noise celluar barcodes HOT 1
- Avoid expanding collapsed counts HOT 5
- valid cellular barcodes HOT 1
- error in umis cb_filter HOT 7
- ddseq SureCell HOT 2
- So slow when cb_filter HOT 2
- Error when running fasttagcount without cb_histogram
- Multi-maps HOT 3
- Error running umis tagcount with 1.0.3 (working on 1.0.0) HOT 3
- pip install not working HOT 6
- tagcount error : AttributeError HOT 10
- IOError: [Errno 28] No space left on device HOT 6
- cell barcode question HOT 7
- tagcount with --genemap option HOT 8
- Set up travis
- Transforms don't understand multiple UMIs HOT 6
- v1.0.8 yields KeyErrors when running test.sh
- Incorrect version HOT 2
- Ensure proper version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ď¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from umis.