Thank you for writing UNCALLED. It was a seamless install and worked quite well with s

MemoryError: std::bad_alloc about uncalled HOT 5 CLOSED

skovaka commented on July 17, 2024

MemoryError: std::bad_alloc

from uncalled.

Comments (5)

skovaka commented on July 17, 2024

Sorry for the trouble! It looks like it failed when trying to determine the repetitiveness of the reference via self-alignment, which could fail if the genome is too repetitive. Did you run any repeat masking, namely the "internal masking" described here? You could also try reducing the number of alignments checked via the "--max-samples" option. Try adding --max-samples 100000 to the index command (the default is 1000000).

from uncalled.

hasindu2008 commented on July 17, 2024

I please check if I did it right. I am still getting the same error:

$ ./mask_internal.sh chr22.fa 10 30 chr22_internal_masked.fa
Iteration 0: masked 51812 occurences of TTTTTTTTTT
Iteration 1: masked 51562 occurences of AAAAAAAAAA
Iteration 2: masked 9296 occurences of TGTAATCCCA
Iteration 3: masked 9198 occurences of TGGGATTACA
Iteration 4: masked 8816 occurences of GGAGGCTGAG
Iteration 5: masked 8752 occurences of CTCAGCCTCC
Iteration 6: masked 7462 occurences of CACACACACA
Iteration 7: masked 7046 occurences of CCCAGGCTGG
Iteration 8: masked 6902 occurences of CCAGCCTGGG
Iteration 9: masked 6797 occurences of TGTGTGTGTG
Iteration 10: masked 6211 occurences of TGCAGTGAGC
Iteration 11: masked 6135 occurences of GCTCACTGCA
Iteration 12: masked 6018 occurences of AAAATACAAA
Iteration 13: masked 5924 occurences of TTTGTATTTT
Iteration 14: masked 4984 occurences of TCCCAAAGTG
Iteration 15: masked 4982 occurences of AGTGCAGTGG
Iteration 16: masked 4922 occurences of CCACTGCACT
Iteration 17: masked 4912 occurences of CACTTTGGGA
Iteration 18: masked 4707 occurences of GCAGGAGAAT
Iteration 19: masked 4690 occurences of AGGTCAGGAG
Iteration 20: masked 4621 occurences of ATTCTCCTGC
Iteration 21: masked 4536 occurences of CTCCTGACCT
Iteration 22: masked 4095 occurences of TATATATATA
Iteration 23: masked 4022 occurences of AGACCAGCCT
Iteration 24: masked 4013 occurences of AGGCTGGTCT
Iteration 25: masked 3850 occurences of CCCAGCTACT
Iteration 26: masked 3757 occurences of GGTGAAACCC
Iteration 27: masked 3713 occurences of AGTAGCTGGG
Iteration 28: masked 3654 occurences of GGGTTTCACC
Iteration 29: masked 3339 occurences of GTCTCTACTA

$uncalled index chr22_internal_masked.famask30.fa --max-samples 100000
[bwa_index] Pack FASTA... 0.53 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=101636936, availableWord=19151484
[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.
[bwt_gen] Finished constructing BWT in 40 iterations.
[bwa_index] 37.85 seconds elapse.
[bwa_index] Update BWT... 0.30 sec
[bwa_index] Pack forward-only FASTA... 0.39 sec
[bwa_index] Construct SA from BWT and Occ... 11.30 sec
Initializing parameter search
Traceback (most recent call last):
  File "/home/hasindu/uncalled/bin/uncalled", line 339, in <module>
    index_cmd(args)
  File "/home/hasindu/uncalled/bin/uncalled", line 56, in index_cmd
    p = unc.index.IndexParameterizer(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 62, in __init__
    self.calc_map_stats(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 82, in calc_map_stats
    fmlens = unc.self_align(args.bwa_prefix, sample_dist)
MemoryError: std::bad_alloc
Command exited with non-zero status 1

from uncalled.

skovaka commented on July 17, 2024

It looks like the issue is the high frequency of Ns in the reference (chromosome 22 is 22% Ns!). BWA indexes are supposed to replace Ns with random bases, but it appears to repeat a pseudo-random sequence when that many Ns occur, which blows up the repeat finding process. The reference I have contains 50 bases per line, and I simply removed all lines which are entirely Ns, which fixed the problem. I will work on a less hacky way of fixing this soon.

from uncalled.

hasindu2008 commented on July 17, 2024

OK I see. But if those are removed, then the mapping coordinates will change, right? And that will give bad results when I call uncalled pafstats?

I am also having a few queries about pafstats.

Does it handle supplementary and secondary alignments in the truthset?
What minimap2 parameters will you recommend to generate the truthset?
I am thinking of doing minimap2 -c -x map-ont ref.fa reads.fastq --secondary=no > ref.paf

from uncalled.

skovaka commented on July 17, 2024

Yes, unfortunately that will cause incorrect pafstats results, unless you realign the basecalled reads to the modified reference. Sorry, I know that's not ideal. I will work on a better solution.

We do consider supplementary alignments. We count an alignment as "true" if it matches any alignment in the truth set. We only used default parameters besides "-x map-ont". "-c" isn't required since we only consider the endpoints, and neither is "--secondary=no".

from uncalled.

MemoryError: std::bad_alloc about uncalled HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent