Giter VIP home page Giter VIP logo

Comments (8)

rob-p avatar rob-p commented on June 10, 2024

Hi @nicolasstransky --- thanks for reporting this. Now the question is, how should this be handled? I see at least 2 obvious possibilities :

  1. Assume that the transcript name should be split at the first whitespace character or |. Currently,
    it is only split at the first whitespace.
  2. If a gtf is provided for gene-level quantification, ensure that some non-trivial number of genes (e.g.
    more than half?) have at least 1 transcript in the index corresponding to them. If not, then complain.

Of course, there are also potentially other, better solutions; so I'm open to suggestions. The problem with 1 is that de-novo assemblers may have transcript names that are not unique up to the first |, so that the whole name needs to be taken into account. The problem with 2 is that it alerts the user of this potential issue, but doesn't resolve it. In the latter case, the user could provide the transcript-to-gene mapping using the provided transcript names in the "simple" format — i.e.

a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab

which is also accepted by the --geneMap option. I sort of lean toward 2, but, as I said, am happy to consider other suggestions.

from salmon.

nicolasstransky avatar nicolasstransky commented on June 10, 2024

Fair points. There are potentially a lot of special cases but since Gencode is widely used, it would be great to have a way to handle its format natively (i.e consider | in addition to a whitespace).
It think the problem with 2. is not a real problem because if you can't match transcript names in the gtf file that is provided, it's likely that there is a problem with the input.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

This issue reminds me to ask: what is the best way to ingest a GTF plus reference FASTA file and produce a transcript FASTA file ready for salmon indexing? I see that there may be some issues with using cufflinks gtf-to-fasta tool: https://groups.google.com/forum/#!msg/sailfish-users/oNVLlxJzgv4/nQYt9m4BBOcJ

from salmon.

rob-p avatar rob-p commented on June 10, 2024

@nicolasstransky --- Ok, so, while I'm generally reticent to adopt special cases, GenCode may warrant one. Or, a more general solution would be to allow the user to specify a list of "separator" characters while indexing (which defaults to \s+). I think that, so far, I actually like this option the best. Also, this isn't mutually exclusive with 2. The ideal thing would be to (1) allow arbitrary separators defined by the user and (2) warn the user if many genes seem to have no transcripts in the index.

from salmon.

rob-p avatar rob-p commented on June 10, 2024

@mdshw5, the best option I've found so far is actually rsem-prepare-reference. It's a bit slower than gtf-to-fasta, but, so far, seems to do a better job producing a usable transcriptome in the general case.

from salmon.

nicolasstransky avatar nicolasstransky commented on June 10, 2024

@rob-p Using a list of "separator" characters is a nice idea. I think that's the best solution so far. However, it would also be a good thing that Gencode files work "out of the box" since they are so commonly used.

from salmon.

mdshw5 avatar mdshw5 commented on June 10, 2024

Thanks, @rob-p. In the same vein, have you considered taking a GTF + FASTA for salmon index? It seems this might even solve @nicolasstransky's issue here.

from salmon.

rob-p avatar rob-p commented on June 10, 2024

The gencode option behaves described above, and is implemented as of commit d44df88, so it should make it into the next tagged release.

from salmon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.