Comments (5)
Sorry for the trouble! It looks like it failed when trying to determine the repetitiveness of the reference via self-alignment, which could fail if the genome is too repetitive. Did you run any repeat masking, namely the "internal masking" described here? You could also try reducing the number of alignments checked via the "--max-samples" option. Try adding --max-samples 100000
to the index command (the default is 1000000).
from uncalled.
I please check if I did it right. I am still getting the same error:
$ ./mask_internal.sh chr22.fa 10 30 chr22_internal_masked.fa
Iteration 0: masked 51812 occurences of TTTTTTTTTT
Iteration 1: masked 51562 occurences of AAAAAAAAAA
Iteration 2: masked 9296 occurences of TGTAATCCCA
Iteration 3: masked 9198 occurences of TGGGATTACA
Iteration 4: masked 8816 occurences of GGAGGCTGAG
Iteration 5: masked 8752 occurences of CTCAGCCTCC
Iteration 6: masked 7462 occurences of CACACACACA
Iteration 7: masked 7046 occurences of CCCAGGCTGG
Iteration 8: masked 6902 occurences of CCAGCCTGGG
Iteration 9: masked 6797 occurences of TGTGTGTGTG
Iteration 10: masked 6211 occurences of TGCAGTGAGC
Iteration 11: masked 6135 occurences of GCTCACTGCA
Iteration 12: masked 6018 occurences of AAAATACAAA
Iteration 13: masked 5924 occurences of TTTGTATTTT
Iteration 14: masked 4984 occurences of TCCCAAAGTG
Iteration 15: masked 4982 occurences of AGTGCAGTGG
Iteration 16: masked 4922 occurences of CCACTGCACT
Iteration 17: masked 4912 occurences of CACTTTGGGA
Iteration 18: masked 4707 occurences of GCAGGAGAAT
Iteration 19: masked 4690 occurences of AGGTCAGGAG
Iteration 20: masked 4621 occurences of ATTCTCCTGC
Iteration 21: masked 4536 occurences of CTCCTGACCT
Iteration 22: masked 4095 occurences of TATATATATA
Iteration 23: masked 4022 occurences of AGACCAGCCT
Iteration 24: masked 4013 occurences of AGGCTGGTCT
Iteration 25: masked 3850 occurences of CCCAGCTACT
Iteration 26: masked 3757 occurences of GGTGAAACCC
Iteration 27: masked 3713 occurences of AGTAGCTGGG
Iteration 28: masked 3654 occurences of GGGTTTCACC
Iteration 29: masked 3339 occurences of GTCTCTACTA
$uncalled index chr22_internal_masked.famask30.fa --max-samples 100000
[bwa_index] Pack FASTA... 0.53 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=101636936, availableWord=19151484
[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.
[bwt_gen] Finished constructing BWT in 40 iterations.
[bwa_index] 37.85 seconds elapse.
[bwa_index] Update BWT... 0.30 sec
[bwa_index] Pack forward-only FASTA... 0.39 sec
[bwa_index] Construct SA from BWT and Occ... 11.30 sec
Initializing parameter search
Traceback (most recent call last):
File "/home/hasindu/uncalled/bin/uncalled", line 339, in <module>
index_cmd(args)
File "/home/hasindu/uncalled/bin/uncalled", line 56, in index_cmd
p = unc.index.IndexParameterizer(args)
File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 62, in __init__
self.calc_map_stats(args)
File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 82, in calc_map_stats
fmlens = unc.self_align(args.bwa_prefix, sample_dist)
MemoryError: std::bad_alloc
Command exited with non-zero status 1
from uncalled.
It looks like the issue is the high frequency of Ns in the reference (chromosome 22 is 22% Ns!). BWA indexes are supposed to replace Ns with random bases, but it appears to repeat a pseudo-random sequence when that many Ns occur, which blows up the repeat finding process. The reference I have contains 50 bases per line, and I simply removed all lines which are entirely Ns, which fixed the problem. I will work on a less hacky way of fixing this soon.
from uncalled.
OK I see. But if those are removed, then the mapping coordinates will change, right? And that will give bad results when I call uncalled pafstats?
I am also having a few queries about pafstats.
- Does it handle supplementary and secondary alignments in the truthset?
- What minimap2 parameters will you recommend to generate the truthset?
I am thinking of doingminimap2 -c -x map-ont ref.fa reads.fastq --secondary=no > ref.paf
from uncalled.
Yes, unfortunately that will cause incorrect pafstats results, unless you realign the basecalled reads to the modified reference. Sorry, I know that's not ideal. I will work on a better solution.
We do consider supplementary alignments. We count an alignment as "true" if it matches any alignment in the truth set. We only used default parameters besides "-x map-ont". "-c" isn't required since we only consider the endpoints, and neither is "--secondary=no".
from uncalled.
Related Issues (20)
- [QUERY] Can I test UNCALLED to always try and map only 2000 raw samples from FAST5 HOT 1
- Uncalled in ubuntu 16 HOT 3
- Segmentation fault in 'uncalled sim' HOT 3
- Floating point exception HOT 8
- uncalled failed to connect to minknow instance HOT 3
- Running Uncalled using Flongle flow cell HOT 2
- Updated Uncalled and Minknow No Longer Working HOT 4
- Generating sequencing summary from fast5 raw reads HOT 1
- [QUESTION] mapping to reference stringency HOT 1
- Installing UNCALLED4 error HOT 12
- [QUESTION] 10X run-time for reads of 500 raw signals HOT 12
- Visualising f5c resquiggle output in UNCALLED4 HOT 6
- Computer requirement for UNCALLED HOT 1
- `sim` segmentation error HOT 7
- Installation on Ubuntu (issue with compiler) HOT 5
- Sequenced reads are too short HOT 4
- New release HOT 2
- Remove mux scan windows from Flongle run HOT 1
- Error when trying example HOT 12
- Fast5 file `vbz` problem HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from uncalled.