Comments (12)
That is a little surprising, but I think it's a product of the branching alignment algorithm. You would hope that the runtime would only double if the signal length doubles, but it's not necessarily linear because of the branching. I mostly benchmarked signal chunks in 4000 sample segments, so I wonder if the potentially exponential branching levels off at some point between 1000 and 4000 samples? That would definitely be reference specific, but your reference isn't very long.
from uncalled.
@skovaka
I agree to your point but I think your answer mostly answers my below observation.
One other thing I observed is for average read length (in raw samples) in range (1000,10000,500), the throughput in Msamples/sec increases monotonically.
from uncalled.
@skovaka
I'm wondering if UNCALLED has a one-time fixed cost (possibly memory related) per read (<~0.1s), say path buffer allocation or something like that? I have excluded the FAST5 loading time while observing this.
This would help explain my last noted observation.
from uncalled.
Sorry the delay, I understand your question better now. Longer reads do take longer to align, since each event in an entire read is iterated over with the default "uncalled map" command. Additionally, each event iteration is not constant time, since the branching search algorithm considers many paths per event. Repetitive sequences can make the branching worse, so the read length/time relationship is not even consistent between reads. You could set the "max chunks" option to limit the number of chunks attempted to align, which would lessen the effect of read length and is how uncalled works in realtime mode, but time per read still wouldn't necessarily be constant. Hope that clears things up!
from uncalled.
Thank You for your answer @skovaka. I really appreciate you taking your time to clarify my doubt.
I understand your point that read length/time relation is not consistent between reads. However, I'm still not sure I found the answer I was looking for.
I don't think I made the following point clear: Please note that I am comparing number of reads processed/time between the same set of reads and not different reads. Additionally, all my reads are non-targets (human) and I try to map it to a ~100Kb microbial target reference.
Let me explain my experiment a bit.
- I have N reads which are all atleast 10K raw samples long (subseted using ont_fast5_api).
- Thereafter, I used fast5_research to trim the reads and create sub-datasets from the same N reads. Each sub-dataset would have the same fixed number of raw samples per read- 500,1000,1500,2000,2500...10000. I have 20 sub-datasets now.
- I was talking about running UNCALLED on these 20 sub-datasets, each having the same N reads (but different lengths) and finding the average time per read. I observed the average time per read is in the range 100.1 to 101.16ms on a Intel Cascade Lake core. Now, increasing the sample length of the same set of reads (by X20) does not seem to have increased the mapping time significantly.
- I see your point that mapping time per event can be different due to branching but the above trend in (3) seems to point to some kind of fixed one-time cost that UNCALLED pays for the mapping the first 500 samples or so.
I'm just trying to find your expert opinion on what that cost would be/if (3) makes sense to you in some way.
from uncalled.
I see, so the question is why it does NOT take longer to align longer reads? The main per-read constant cost would be loading from fast5s, everything else should be allocated once. I bet it's because most reads align within the first few thousand samples, and it will not continue aligning after a confident alignment is identified, so additional signal won't make much difference. This is not how an aligner like minimap works, since we're not aiming for full-length alignments. Sorry for the confusion, I realize I was wrong yesterday when I said it iterates over the entire read, it's been awhile since I worked on the algorithm
from uncalled.
@skovaka
Yes, you got my question. Thank You so much.
I may just be seeing FAST5 loading time.
Is UNCALLED done loading FAST5's at this step?
I see the file read happens [here]:(
Line 227 in 74a5d4e
If not, then I am worried there may be something else that we are missing. I'm trying to map all non-target reads. UNCALLED does it very well mapping only 2.5% of the non-target reads (10K sample per read sub-dataset) and even less of them with lesser samples per read. And I don't think UNCALLED's default configuration has an early termination heuristic that saves time for non-target reads that don't map (for less than 10K samples), right?
from uncalled.
I may be wrong.
But, looking at the code, I think you are loading fast5's only at this step here. So the thread pool is basically loading and mapping not just mapping.
So, I was just counting the file loading time.
Thank You.
from uncalled.
Are you profiling the C++ binary with a single thread? Python would have additional overhead, and I wouldn't be surprised if my thread pool is not the most efficient. The thread pool loads fast5s in the main thread, then mapping happens within the worker threads, so it's quite possible time is dominated by main thread fast5 loading. You could actually try running the "uncalled_map_ord" binary, which loads all fast5s initially and then aligns reads in the order they were sequenced. That's not the default mode since it requires more memory, but could be helpful for you
from uncalled.
@skovaka
Yes, I am trying uncalled_map (C++ binary) with single thread.
Am I correct to note that FAST5 loading happens [here] for every thread?(
Line 48 in 74a5d4e
Let me try uncalled_map_ord. Seems to suit my use case. Thank You.
from uncalled.
Fast5 loading does happen there for "uncalled_map", but map_ord uses a different thread pool. To be clear, that function is executed on the main thread, so the fast5 loading is always single threaded, but it's loading reads into buffers which are processed by the worker threads here
Line 130 in 74a5d4e
from uncalled.
Thank You, Sam Kovaka.
UNCALLED is an excellent tool. Thank You for maintaining the repo.
I really enjoyed looking into the code and the paper.
from uncalled.
Related Issues (20)
- undemultiplex fast5 HOT 5
- Implementation on ARM64 Raspberry Pi 4 HOT 3
- barcodes HOT 2
- Read lengths in .paf files HOT 3
- Minimum hardware requirements HOT 4
- MemoryError: std::bad_alloc HOT 5
- Error when running uncalled map HOT 5
- Computer Specifications HOT 1
- [QUERY] Can I test UNCALLED to always try and map only 2000 raw samples from FAST5 HOT 1
- Uncalled in ubuntu 16 HOT 3
- Segmentation fault in 'uncalled sim' HOT 3
- Floating point exception HOT 8
- uncalled failed to connect to minknow instance HOT 3
- Running Uncalled using Flongle flow cell HOT 2
- Updated Uncalled and Minknow No Longer Working HOT 4
- Generating sequencing summary from fast5 raw reads HOT 1
- [QUESTION] mapping to reference stringency HOT 1
- Installing UNCALLED4 error HOT 12
- Visualising f5c resquiggle output in UNCALLED4 HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from uncalled.