Comments (8)
Hi @nicolasstransky --- thanks for reporting this. Now the question is, how should this be handled? I see at least 2 obvious possibilities :
- Assume that the transcript name should be split at the first whitespace character or
|
. Currently,
it is only split at the first whitespace. - If a gtf is provided for gene-level quantification, ensure that some non-trivial number of genes (e.g.
more than half?) have at least 1 transcript in the index corresponding to them. If not, then complain.
Of course, there are also potentially other, better solutions; so I'm open to suggestions. The problem with 1 is that de-novo assemblers may have transcript names that are not unique up to the first |
, so that the whole name needs to be taken into account. The problem with 2 is that it alerts the user of this potential issue, but doesn't resolve it. In the latter case, the user could provide the transcript-to-gene mapping using the provided transcript names in the "simple" format — i.e.
a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab
which is also accepted by the --geneMap
option. I sort of lean toward 2, but, as I said, am happy to consider other suggestions.
from salmon.
Fair points. There are potentially a lot of special cases but since Gencode is widely used, it would be great to have a way to handle its format natively (i.e consider |
in addition to a whitespace).
It think the problem with 2. is not a real problem because if you can't match transcript names in the gtf file that is provided, it's likely that there is a problem with the input.
from salmon.
This issue reminds me to ask: what is the best way to ingest a GTF plus reference FASTA file and produce a transcript FASTA file ready for salmon indexing? I see that there may be some issues with using cufflinks gtf-to-fasta tool: https://groups.google.com/forum/#!msg/sailfish-users/oNVLlxJzgv4/nQYt9m4BBOcJ
from salmon.
@nicolasstransky --- Ok, so, while I'm generally reticent to adopt special cases, GenCode may warrant one. Or, a more general solution would be to allow the user to specify a list of "separator" characters while indexing (which defaults to \s+
). I think that, so far, I actually like this option the best. Also, this isn't mutually exclusive with 2. The ideal thing would be to (1) allow arbitrary separators defined by the user and (2) warn the user if many genes seem to have no transcripts in the index.
from salmon.
@mdshw5, the best option I've found so far is actually rsem-prepare-reference. It's a bit slower than gtf-to-fasta, but, so far, seems to do a better job producing a usable transcriptome in the general case.
from salmon.
@rob-p Using a list of "separator" characters is a nice idea. I think that's the best solution so far. However, it would also be a good thing that Gencode files work "out of the box" since they are so commonly used.
from salmon.
Thanks, @rob-p. In the same vein, have you considered taking a GTF + FASTA for salmon index
? It seems this might even solve @nicolasstransky's issue here.
from salmon.
The gencode option behaves described above, and is implemented as of commit d44df88, so it should make it into the next tagged release.
from salmon.
Related Issues (20)
- Installation of Salmon on M1 Mac HOT 1
- 32-bit support?
- --biasCorrect not recognised HOT 1
- alevin handling of sublibrary BC in Parse data HOT 4
- mapping features from chromium v2 10X library
- Seg Fault in salmon quant HOT 2
- Inconsistency in salmon quant Documentation Regarding --eqclasses Option HOT 2
- missing flags for indropV2 HOT 4
- How to handle Multiplet data?
- -seqbias | is it specific to random hexameric primers ? HOT 4
- Segmentation fault in salmon quant HOT 4
- (alevin) Specifying --read-geometry in paired-end samples
- Hi @Ray6283, HOT 1
- anaconda version of salmon outdated, missing decoys option HOT 7
- Salmon quant error in --ont mode (Bus error (core dumped)
- View salmon quant output in a browser
- Mapping one organism from a mixed tissue sample HOT 2
- segmentation fault when skipQuant flag is set
- Quantification in Alignment mode for Nanopore Data HOT 2
- Please make gencode SA files available
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from salmon.