Giter VIP home page Giter VIP logo

Comments (5)

skovaka avatar skovaka commented on July 17, 2024

Sorry for the trouble! It looks like it failed when trying to determine the repetitiveness of the reference via self-alignment, which could fail if the genome is too repetitive. Did you run any repeat masking, namely the "internal masking" described here? You could also try reducing the number of alignments checked via the "--max-samples" option. Try adding --max-samples 100000 to the index command (the default is 1000000).

from uncalled.

hasindu2008 avatar hasindu2008 commented on July 17, 2024

I please check if I did it right. I am still getting the same error:

$ ./mask_internal.sh chr22.fa 10 30 chr22_internal_masked.fa
Iteration 0: masked 51812 occurences of TTTTTTTTTT
Iteration 1: masked 51562 occurences of AAAAAAAAAA
Iteration 2: masked 9296 occurences of TGTAATCCCA
Iteration 3: masked 9198 occurences of TGGGATTACA
Iteration 4: masked 8816 occurences of GGAGGCTGAG
Iteration 5: masked 8752 occurences of CTCAGCCTCC
Iteration 6: masked 7462 occurences of CACACACACA
Iteration 7: masked 7046 occurences of CCCAGGCTGG
Iteration 8: masked 6902 occurences of CCAGCCTGGG
Iteration 9: masked 6797 occurences of TGTGTGTGTG
Iteration 10: masked 6211 occurences of TGCAGTGAGC
Iteration 11: masked 6135 occurences of GCTCACTGCA
Iteration 12: masked 6018 occurences of AAAATACAAA
Iteration 13: masked 5924 occurences of TTTGTATTTT
Iteration 14: masked 4984 occurences of TCCCAAAGTG
Iteration 15: masked 4982 occurences of AGTGCAGTGG
Iteration 16: masked 4922 occurences of CCACTGCACT
Iteration 17: masked 4912 occurences of CACTTTGGGA
Iteration 18: masked 4707 occurences of GCAGGAGAAT
Iteration 19: masked 4690 occurences of AGGTCAGGAG
Iteration 20: masked 4621 occurences of ATTCTCCTGC
Iteration 21: masked 4536 occurences of CTCCTGACCT
Iteration 22: masked 4095 occurences of TATATATATA
Iteration 23: masked 4022 occurences of AGACCAGCCT
Iteration 24: masked 4013 occurences of AGGCTGGTCT
Iteration 25: masked 3850 occurences of CCCAGCTACT
Iteration 26: masked 3757 occurences of GGTGAAACCC
Iteration 27: masked 3713 occurences of AGTAGCTGGG
Iteration 28: masked 3654 occurences of GGGTTTCACC
Iteration 29: masked 3339 occurences of GTCTCTACTA

$uncalled index chr22_internal_masked.famask30.fa --max-samples 100000
[bwa_index] Pack FASTA... 0.53 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=101636936, availableWord=19151484
[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.
[bwt_gen] Finished constructing BWT in 40 iterations.
[bwa_index] 37.85 seconds elapse.
[bwa_index] Update BWT... 0.30 sec
[bwa_index] Pack forward-only FASTA... 0.39 sec
[bwa_index] Construct SA from BWT and Occ... 11.30 sec
Initializing parameter search
Traceback (most recent call last):
  File "/home/hasindu/uncalled/bin/uncalled", line 339, in <module>
    index_cmd(args)
  File "/home/hasindu/uncalled/bin/uncalled", line 56, in index_cmd
    p = unc.index.IndexParameterizer(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 62, in __init__
    self.calc_map_stats(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 82, in calc_map_stats
    fmlens = unc.self_align(args.bwa_prefix, sample_dist)
MemoryError: std::bad_alloc
Command exited with non-zero status 1

from uncalled.

skovaka avatar skovaka commented on July 17, 2024

It looks like the issue is the high frequency of Ns in the reference (chromosome 22 is 22% Ns!). BWA indexes are supposed to replace Ns with random bases, but it appears to repeat a pseudo-random sequence when that many Ns occur, which blows up the repeat finding process. The reference I have contains 50 bases per line, and I simply removed all lines which are entirely Ns, which fixed the problem. I will work on a less hacky way of fixing this soon.

from uncalled.

hasindu2008 avatar hasindu2008 commented on July 17, 2024

OK I see. But if those are removed, then the mapping coordinates will change, right? And that will give bad results when I call uncalled pafstats?

I am also having a few queries about pafstats.

  1. Does it handle supplementary and secondary alignments in the truthset?
  2. What minimap2 parameters will you recommend to generate the truthset?
    I am thinking of doing minimap2 -c -x map-ont ref.fa reads.fastq --secondary=no > ref.paf

from uncalled.

skovaka avatar skovaka commented on July 17, 2024

Yes, unfortunately that will cause incorrect pafstats results, unless you realign the basecalled reads to the modified reference. Sorry, I know that's not ideal. I will work on a better solution.

We do consider supplementary alignments. We count an alignment as "true" if it matches any alignment in the truth set. We only used default parameters besides "-x map-ont". "-c" isn't required since we only consider the endpoints, and neither is "--secondary=no".

from uncalled.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.