Giter VIP home page Giter VIP logo

Comments (18)

rob-p avatar rob-p commented on June 10, 2024

Strange --- were there any complaints during index creation? Was the index created successfully? Since there's no core dump, the only thought is that I could try to re-create on a small sample (the reference plus a small set of reads).

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

Confirmed with v0.6.0:

Version Info: Could not resolve upgrade information in the alotted time.
Check for upgrades manually at https://combine-lab.github.io/salmon
# salmon (mapping-based) v0.6.0
# [ program ] => salmon
# [ command ] => quant
# [ index ] => { ... }
# [ libType ] => { IU }
# [ mates1 ] => { ... }
# [ mates2 ] => { ... }
# [ output ] => {... }
# [ threads ] => { 16 }
Logs will be written to ...
there is 1 lib
[2016-01-22 17:59:17.894] [jointLog] [info] parsing read library format
Loading 32-bit quasi index[2016-01-22 17:59:18.735] [stderrLog] [info] Loading Suffix Array
[2016-01-22 17:59:18.736] [stderrLog] [info] Loading Position Hash
[2016-01-22 17:59:18.731] [jointLog] [info] Loading Quasi index
[2016-01-22 18:00:59.879] [stderrLog] [info] Loading Transcript Info
[2016-01-22 18:01:25.157] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-01-22 18:01:30.642] [stderrLog] [info] There were 552702 set bits in the bit a
[2016-01-22 18:01:31.487] [stderrLog] [info] Computing transcript lengths
[2016-01-22 18:01:31.491] [stderrLog] [info] Waiting to finish loading hash
Index contained 552702 targets
[2016-01-22 18:04:43.717] [jointLog] [info] done
[2016-01-22 18:04:43.717] [stderrLog] [info] Done loading index

I'll check the index creation logs, but didn't notice anything out of the ordinary...

from salmon.

rob-p avatar rob-p commented on June 10, 2024

It seems there are a bunch of targets --- ~500k. That's not a problem, but does that sound right for this reference?

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

Yes :)

from salmon.

rob-p avatar rob-p commented on June 10, 2024

One more question --- what is the approximate size (in nucleotides) of the reference? If it's greater than ~2.14 billion, then it should be using the 64-bit index, which this one is not.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

That was going to a my next question!

from salmon.

rob-p avatar rob-p commented on June 10, 2024

If it is a large (i.e. > 2^31 nucleotide reference) then it should trigger the 64-bit index automatically. If there's a failure to do that, it's a bug I have to fix in RapMap. Admittedly, I've not tried to map to many transcriptomes that large, so I'd be much obliged if you could provide me with an example to trigger that behavior :).

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

The indexing log shows nothing out of the ordinary:

[2016-01-22 15:11:57.283] [jointLog] [info] building index
[2016-01-22 15:40:12.318] [jointLog] [info] done building index

There was actually a blank line at the very end of the transcriptome FASTA which I though might be related to #22, so I removed this line, re-indexed and have the same behavior. I'll check on the nucleotide size of the transcriptome now.

cc @jmerkin

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

The nucleotide size is 1486025420, so we are using the correct bit depth. I'm checking the FASTA headers for strange characters that might cause a parsing issue. Anything that I should look for?

from salmon.

rob-p avatar rob-p commented on June 10, 2024

Yes; that should definitely correctly be identified as 32-bit. The way the parser works is that it "chops" the header at the first whitepsace character. I can't think of anything that would cause failure during mapping (but bugs come from exactly the kind of thing you can't think of). Something that might cause an issue now that I think about it is a complete poly-A transcript. The indexer will attempt to clip poly-A tails (if a transcript ends with > 10 A's, then it will clip all of the trailing A's. If this causes the entire sequence to disappear, this might cause an issue. Also, I hadn't given deep consideration to what might happen if a transcript is shorter than the k-mer size (default 31) used for hashing --- so I might also check for very short transcripts.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

Thanks for the suggestions. I'm building the quasi index using RapMap now. If I get the same behavior I'll try to debug on my end before leaning more on you.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

rapmap works fine with this set of transcripts. indexing:

$ rapmap pseudoindex -k 31 -i /path/to/output -t /path/to/transcripts.fa
RapMap Indexer

[Step 1 of 4] : counting k-mers
counted k-mers for 550000 transcripts^[
Elapsed time: 3526.23s
Clipped poly-A tails from 2375 transcripts

[Step 2 of 4] : marking k-mers
marked kmers for 550000 transcripts
Elapsed time: 1295.67s

[Step 3 of 4] : building k-mers equivalence classes
done! There were 5077370 classes
Elapsed time: 1351.53s

[Step 4 of 4] : finalizing index
finalized kmers for 550000 transcripts
Elapsed time: 4424.16s
Writing the index to test3/
transcriptIDs.size() = 1419746642
parsed 552702 transcripts
There were 1015977902 distinct k-mers (canonicalized)

which looks fine, and then alignments are generated and rapmap exists with no errors.

from salmon.

rob-p avatar rob-p commented on June 10, 2024

That is . . . strange! Salmon literally uses the RapMap index (and the RapMap functions) directly to obtain the quasi-mappings. One thing I noticed is that you seem to be using pseudoindex which is our independent re-implementation of pseudo-alignment. However, Salmon (and Sailfish) use quasi-mapping (RapMap's quasiindex and quasimap commands, as we found this to be more accurate). I presume that if you used the quasi-mapping functionality, you might observe the bug. If you don't (i.e. if RapMap performs quasi-mapping properly), then this is a real thinker (and I'd be happy to take a look myself if you can share the file).

P.S. The same caveat I mentioned above may apply. That is, it is possible that a polyA transcript that is completely removed from the input could cause a problem unless we check for it in the quasi-index, but may not affect the pseudo-index. This is because the quasi-index relies on a packed representation of the transcriptome and an associated sparse bit-vector to perform the mapping, and it assumes that all of the transcripts will have a non-zero length (if this is the culprit, it is, of course, easy to fix with an explicit check). You could also test this hypothesis by generating the quasi-index with the --noClip option, which will disable poly-A clipping when building the index.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

Sorry for the delay in responding to this, and thanks for your support on debugging potential issues. I ran RapMap (nice tool) successfully and then realized that I was just encountering the linux OOM killer... So I'm closing this issue as there really was no issue. Thanks again.

from salmon.

rob-p avatar rob-p commented on June 10, 2024

No problem! We're actually working now on an optional use of a perfect hash in the quasi-index. It increases index construction times, but provides the same speed of lookup as the current hash. Also, it reduces the memory usage by a factor of ~2. We just have to figure out how to implement this cleanly in the code base.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

BIGGG +1 for this
cc @jmerkin

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

If you need testers for this I'm glad to help.

from salmon.

rob-p avatar rob-p commented on June 10, 2024

It's currently being developed here — https://github.com/COMBINE-lab/RapMap/tree/quasi-mph. Once we're convinced RapMap still behaves correctly when using the perfect hash index, then I have some (not too much) work to do to propagate the necessary changes to Sailfish & Salmon. The option is currently functional. If you grab this branch and build a quasi index using the -p option, it will use the emphf library to build the hash rather than a google dense hash (with a concordant decrease in memory usage).

from salmon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.