combine-lab / salmon Goto Github PK

View Code? Open in Web Editor NEW

735.0 39.0 158.0 214.16 MB

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

Home Page: https://combine-lab.github.io/salmon

License: GNU General Public License v3.0

CMake 0.53% C++ 97.56% C 1.56% Shell 0.21% Python 0.08% Standard ML 0.01% Nextflow 0.02% Dockerfile 0.03%

quasi-mapping bioinformatics rna-seq rnaseq salmon quantification sailfish c-plus-plus gene-expression scrna-seq

salmon's Introduction

Try out the new alevin-fry framework for single-cell analysis; tutorials can be found here!

Help guide the development of Salmon, take our survey

What is Salmon?

Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data. Salmon achieves its accuracy and speed via a number of different innovations, including the use of selective-alignment (accurate but fast-to-compute proxies for traditional read alignments), and massively-parallel stochastic collapsed variational inference. The result is a versatile tool that fits nicely into many different pipelines. For example, you can choose to make use of our selective-alignment algorithm by providing Salmon with raw sequencing reads, or, if it is more convenient, you can provide Salmon with regular alignments (e.g. an unsorted BAM file with alignments to the transcriptome produced with your favorite aligner), and it will use the same wicked-fast, state-of-the-art inference algorithm to estimate transcript-level abundances for your experiment.

Give salmon a try! You can find the latest binary releases here.

The current version number of the master branch of Salmon can be found here

Documentation

The documentation for Salmon is available on ReadTheDocs, check it out here.

Salmon is, and will continue to be, freely and actively supported on a best-effort basis. If you need industrial-grade technical support, please consider the options at oceangenomics.com/contact.

Decoy sequences in transcriptomes

tl;dr: fast is good but fast and accurate is better! Alignment and mapping methodology influence transcript abundance estimation, and accounting for the accounting for fragments of unexpected origin can improve transcript quantification. To this end, salmon provides the ability to index both the transcriptome as well as decoy seuqence that can be considered during mapping and quantification. The decoy sequence accounts for reads that might otherwise be (spuriously) attributed to some annotated transcript. This tutorial provides a step-by-step guide on how to efficiently index the reference transcriptome and genome to produce a decoy-aware index. Specifically, there are 3 possible ways in which the salmon index can be created:

cDNA-only index : salmon_index - https://combine-lab.github.io/salmon/getting_started/. This method will result in the smallest index and require the least resources to build, but will be the most prone to possible spurious alignments.
SA mashmap index: salmon_partial_sa_index - (regions of genome that have high sequence similarity to the transcriptome) - Details can be found in this README and using this script. While running mashmap can require considerable resources, the resulting decoy files are fairly small. This will result in an index bigger than the cDNA-only index, but still mucch smaller than the full genome index below. It will confer many, though not all, of the benefits of using the entire genome as a decoy sequence.
SAF genome index: salmon_sa_index - (the full genome is used as decoy) - The tutorial for creating such an index can be found here. This will result in the largest index, but likely does the best job in avoiding spurious alignments to annotated transcripts.

Facing problems with Indexing?, Check if anyone else already had this problem in the issues section or fill the index generation request form

NOTE:

If you are generating an index to be used for single-cell or single-nucleus quantification with alevin-fry, then we recommend you consider building a spliced+intron (splici) reference. This serves much of the purpose of a decoy-aware index when quantifying with alevin-fry, while also providing the capability to attribute splicing status to mapped fragments. More details about the splici reference and the Unspliced/Spliced/Ambiguous quantification mode it enables can be found here.

Chat live about Salmon

You can chat with the Salmon developers and other users via Gitter (Note: Gitter is much less frequently monitored than GitHub, so if you have an important problem or question, please consider opening an issue here on GitHub)!

salmon's People

Contributors

Stargazers

Watchers

Forkers

ndaniel blahah xtmgah mr-c ctb lgautier matanhofree vals biocyberman nfahlgren sjackman roryk jltrincado mohsenzakeri yzharold hengshanzi genomicsnx yangming hrk2109 xflicsu pfern javielgato shravya-thatipally jianlian92 minghao2016 xiaoxiaff biocodings schulzlab pkimes yubioinfo ruixiangliu cwt1 tw7649116 yixf-self kurtwheeler smiduthuri mmesbahu strategist922 liupfskygre skerker wangdi2014 tianbu jimmylijing chizhou-siti joshg shians fataltes amgraham07 sga91 bioinfoacademy altingia bimsbbioinfo mdshw5 snehkant mimi1994 habilzare oceangenomics basesloaded junaruga mribeirodantas herongquan pkamanda inambioinfo nbahti iansudbery antpiron gaberoo zorrodong bbi-anitashah-zz bioinfonerd-forks knokknok jdrnevich k3yavi sunhongzheng ea2106-universite-francois-rabelais bpar1 kmfunder yao-hh lingnanyuan xjyx lone-suslik antoniogps linzhi2013 trichelab hiraksarkar sales-lab kevinrue kiskong liuzixinlab denvern3 crerecombinase rstatistics xpl1986 robymetallo rohitjainnference gaura wytamma jjaa-mp coolevilgenius ciaranwelsh

salmon's Issues

Check for non-random ordered reads and fail

I think the requirement for random ordered reads is surprising for some...
Documentation is good, but given how consequential it is to run with ordered reads (see https://twitter.com/KeywanHP/status/646661366774923264), would it make sense to fail if the first 1000 reads are from the same chr and position increases monotonically?
Or at minimum, fail if the BAM has a so:coordinate header.

Question: does salmon use "single-mapped" reads for quant?

When doing paired mapping, we get a lot of single-mapped reads (probably indicative of a crappy transcriptome - we know, we know). Do these reads count towards quantification?

salmon quant -i cDNA_transcripts.index --libType ISR -1 L101_resync_R1.fastq -2 L101_resync_R2.fastq -o L101.quant

[2015-11-03 14:08:26.861] [jointLog] [info] Mapping rate = 26.4703%

[2015-11-03 14:08:26.861] [jointLog] [info] finished quantifyLibrary()
[2015-11-03 14:08:26.861] [jointLog] [info] Starting optimizer
[2015-11-03 14:08:26.878] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2015-11-03 14:08:26.889] [jointLog] [info] iteration = 0 | max rel diff. = 499.998
[2015-11-03 14:08:27.264] [jointLog] [info] iteration = 100 | max rel diff. = 0.216535
[2015-11-03 14:08:27.647] [jointLog] [info] iteration = 200 | max rel diff. = 0.123972
[2015-11-03 14:08:28.032] [jointLog] [info] iteration = 300 | max rel diff. = 0.108899
[2015-11-03 14:08:28.403] [jointLog] [info] iteration = 400 | max rel diff. = 0.0560467
[2015-11-03 14:08:28.772] [jointLog] [info] iteration = 500 | max rel diff. = 0.0441335
[2015-11-03 14:08:29.147] [jointLog] [info] iteration = 600 | max rel diff. = 0.0387812
[2015-11-03 14:08:29.524] [jointLog] [info] iteration = 700 | max rel diff. = 0.0333914
[2015-11-03 14:08:29.901] [jointLog] [info] iteration = 800 | max rel diff. = 0.0282682
[2015-11-03 14:08:30.279] [jointLog] [info] iteration = 900 | max rel diff. = 0.0253705
[2015-11-03 14:08:30.659] [jointLog] [info] iteration = 1000 | max rel diff. = 0.0229764
[2015-11-03 14:08:31.040] [jointLog] [info] iteration = 1100 | max rel diff. = 0.0223721
[2015-11-03 14:08:31.414] [jointLog] [info] iteration = 1200 | max rel diff. = 0.0202505
[2015-11-03 14:08:31.792] [jointLog] [info] iteration = 1300 | max rel diff. = 0.0186214
[2015-11-03 14:08:32.177] [jointLog] [info] iteration = 1400 | max rel diff. = 0.0181308
[2015-11-03 14:08:32.568] [jointLog] [info] iteration = 1500 | max rel diff. = 0.0159512
[2015-11-03 14:08:32.948] [jointLog] [info] iteration = 1600 | max rel diff. = 0.0156004
[2015-11-03 14:08:33.323] [jointLog] [info] iteration = 1700 | max rel diff. = 0.0134322
[2015-11-03 14:08:33.700] [jointLog] [info] iteration = 1800 | max rel diff. = 0.0131983
[2015-11-03 14:08:34.077] [jointLog] [info] iteration = 1900 | max rel diff. = 0.0123282
[2015-11-03 14:08:34.453] [jointLog] [info] iteration = 2000 | max rel diff. = 0.0123282
[2015-11-03 14:08:34.835] [jointLog] [info] iteration = 2100 | max rel diff. = 0.0105099
[2015-11-03 14:08:35.439] [jointLog] [info] iteration = 2261 | max rel diff. = 0.00994575
[2015-11-03 14:08:35.441] [jointLog] [info] Finished optimizer
[2015-11-03 14:08:35.441] [jointLog] [info] writing output 
[2015-11-03 14:08:35.213] [jointLog] [info] iteration = 2200 | max rel diff. = 0.0105132

[2015-11-03 14:08:35.481] [jointLog] [warning] NOTE: Read Lib [( L101_resync_R1.fastq, L101_resync_R2.fastq )] :

Greater than 5% of the alignments (but not, necessarily reads) disagreed with the provided library type; check the file: L101.quant/libFormatCounts.txt for details




L101.quant/libFormatCounts.txt

========
Read library consisting of files: ( L101_resync_R1.fastq, L101_resync_R2.fastq )

Expected format: Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }

# of consistent alignments: 6342265
# of inconsistent alignments: 2277243

========
---- counts for each format type ---
Library format { type:single end, relative orientation:matching, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:matching, strandedness:(sense, antisense) } : 0
Library format { type:single end, relative orientation:outward, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:outward, strandedness:(sense, antisense) } : 43
Library format { type:single end, relative orientation:inward, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:inward, strandedness:(sense, antisense) } : 146846
Library format { type:single end, relative orientation:none, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:none, strandedness:(sense, antisense) } : 0
Library format { type:single end, relative orientation:matching, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:matching, strandedness:(antisense, sense) } : 0
Library format { type:single end, relative orientation:outward, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:outward, strandedness:(antisense, sense) } : 21990
Library format { type:single end, relative orientation:inward, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) } : 6342265
Library format { type:single end, relative orientation:none, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:none, strandedness:(antisense, sense) } : 0
Library format { type:single end, relative orientation:matching, strandedness:sense } : 0
Library format { type:paired end, relative orientation:matching, strandedness:sense } : 9
Library format { type:single end, relative orientation:outward, strandedness:sense } : 0
Library format { type:paired end, relative orientation:outward, strandedness:sense } : 0
Library format { type:single end, relative orientation:inward, strandedness:sense } : 0
Library format { type:paired end, relative orientation:inward, strandedness:sense } : 0
Library format { type:single end, relative orientation:none, strandedness:sense } : 1028091
Library format { type:paired end, relative orientation:none, strandedness:sense } : 0
Library format { type:single end, relative orientation:matching, strandedness:antisense } : 0
Library format { type:paired end, relative orientation:matching, strandedness:antisense } : 430
Library format { type:single end, relative orientation:outward, strandedness:antisense } : 0
Library format { type:paired end, relative orientation:outward, strandedness:antisense } : 0
Library format { type:single end, relative orientation:inward, strandedness:antisense } : 0
Library format { type:paired end, relative orientation:inward, strandedness:antisense } : 0
Library format { type:single end, relative orientation:none, strandedness:antisense } : 1079834
Library format { type:paired end, relative orientation:none, strandedness:antisense } : 0
Library format { type:single end, relative orientation:matching, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:matching, strandedness:unstranded } : 0
Library format { type:single end, relative orientation:outward, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:outward, strandedness:unstranded } : 0
Library format { type:single end, relative orientation:inward, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:inward, strandedness:unstranded } : 0
Library format { type:single end, relative orientation:none, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:none, strandedness:unstranded } : 0
------------------------------------

Segmentation fault on some data when using `--useFSPD`

I've been running Salmon on some data from accession SRP034543. For some it works fine, but some there is segmentation fault when using the --useFSPD. For an example run where it fails, have a look at accession SRR2048254.

Here is the command I ran along with output

$ salmon quant \
> -i /nfs/research2/teichmann/reference/mus-musculus/salmon/mouse_cdna_38.p4.83_repbase20.11_ercc_SIRV.fa \
> -l IU \
> -1 <(zcat /nfs/research2/teichmann/valentine/data/SRP034543/SRR2048254_1.fastq.gz) \
> -2 <(zcat /nfs/research2/teichmann/valentine/data/SRP034543/SRR2048254_2.fastq.gz) \
> -o /tmp/SRR2048254_salmon_out \
> --biasCorrect \
> --useFSPD
Version Info: This is the most recent version of Salmon.
# salmon (mapping-based) v0.6.0
# [ program ] => salmon
# [ command ] => quant
# [ index ] => { /nfs/research2/teichmann/reference/mus-musculus/salmon/mouse_cdna_38.p4.83_repbase20.11_ercc_SIRV.fa }
# [ libType ] => { IU }
# [ mates1 ] => { /dev/fd/63 }
# [ mates2 ] => { /dev/fd/62 }
# [ output ] => { /tmp/SRR2048254_salmon_out }
# [ biasCorrect ] => { }
# [ useFSPD ] => { }
Logs will be written to /tmp/SRR2048254_salmon_out/logs
[2016-06-21 10:04:29.524] [jointLog] [info] parsing read library format
there is 1 lib
Loading 32-bit quasi indextcmalloc: large alloc 4294967296 bytes == 0x4d084000 @
[2016-06-21 10:04:30.159] [stderrLog] [info] Loading Suffix Array
[2016-06-21 10:04:30.159] [stderrLog] [info] Loading Position Hash
[2016-06-21 10:04:30.158] [jointLog] [info] Loading Quasi index
[2016-06-21 10:04:32.681] [stderrLog] [info] Loading Transcript Info
[2016-06-21 10:04:33.686] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-06-21 10:04:34.050] [stderrLog] [info] There were 115426 set bits in the bit array
[2016-06-21 10:04:34.376] [stderrLog] [info] Computing transcript lengths
[2016-06-21 10:04:34.377] [stderrLog] [info] Waiting to finish loading hash
Index contained 115426 targets
[2016-06-21 10:04:47.033] [jointLog] [info] done
[2016-06-21 10:04:47.033] [stderrLog] [info] Done loading index



processed 6500000 fragments
hits: 13927069, hits per frag:  2.15389







[2016-06-21 10:05:13.847] [jointLog] [info] Computed 165969 rich equivalence classes for further processing
[2016-06-21 10:05:13.847] [jointLog] [info] Counted 6514601 total reads in the equivalence classes
[2016-06-21 10:05:13.893] [jointLog] [info] Mapping rate = 95.0922%

[2016-06-21 10:05:13.893] [jointLog] [info] finished quantifyLibrary()
[2016-06-21 10:05:13.894] [jointLog] [info] Starting optimizer
Segmentation fault (core dumped)

It runs fine when disabling --useFSPD

Segmentation fault when reporting results (0.6.0)

I was just trying out the new version. It seems to be working, but it seems to fail at the time of printing results.

Here's a tail of the text output:

[2016-01-02 00:08:48.445] [jointLog] [info] Computed 66783 rich equivalence classes for further processing
[2016-01-02 00:08:48.445] [jointLog] [info] Counted 2977936 total reads in the equivalence classes
[2016-01-02 00:08:54.862] [jointLog] [warning] Only 2977936 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2016-01-02 00:08:54.862] [jointLog] [warning] Since only 2977936 (< 5000000) fragments were observed, modeling of the fragment start position distribution has been disabled
[2016-01-02 00:08:54.862] [jointLog] [info] Mapping rate = 48.8134%

[2016-01-02 00:08:54.862] [jointLog] [info] finished quantifyLibrary()
[2016-01-02 00:08:54.863] [jointLog] [info] Starting optimizer
[2016-01-02 00:08:54.918] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-01-02 00:08:54.921] [jointLog] [info] iteration = 0 | max rel diff. = 48.4964
[2016-01-02 00:08:55.024] [jointLog] [info] iteration 50, recomputing effective lengths
[2016-01-02 00:08:57.626] [jointLog] [info] iteration = 100 | max rel diff. = 0.157189
[2016-01-02 00:08:57.835] [jointLog] [info] iteration = 200 | max rel diff. = 0.0984302
[2016-01-02 00:08:58.048] [jointLog] [info] iteration = 300 | max rel diff. = 0.0774471
[2016-01-02 00:08:58.265] [jointLog] [info] iteration = 400 | max rel diff. = 0.0866256
[2016-01-02 00:08:58.472] [jointLog] [info] iteration 500, recomputing effective lengths
[2016-01-02 00:09:00.486] [jointLog] [info] iteration = 500 | max rel diff. = 0.0216284
[2016-01-02 00:09:00.696] [jointLog] [info] iteration = 600 | max rel diff. = 0.0269734
[2016-01-02 00:09:00.905] [jointLog] [info] iteration = 700 | max rel diff. = 0.0166003
[2016-01-02 00:09:01.113] [jointLog] [info] iteration = 800 | max rel diff. = 0.0136659
[2016-01-02 00:09:01.334] [jointLog] [info] iteration = 900 | max rel diff. = 0.0114614
[2016-01-02 00:09:01.542] [jointLog] [info] iteration 1000, recomputing effective lengths
[2016-01-02 00:09:03.495] [jointLog] [info] iteration = 1000 | max rel diff. = 0.0102234
[2016-01-02 00:09:03.716] [jointLog] [info] iteration = 1100 | max rel diff. = 0.0202324
[2016-01-02 00:09:03.929] [jointLog] [info] iteration = 1200 | max rel diff. = 0.010957
[2016-01-02 00:09:03.946] [jointLog] [info] iteration = 1209 | max rel diff. = 0.00996627
[2016-01-02 00:09:03.952] [jointLog] [info] Finished optimizer
[2016-01-02 00:09:03.952] [jointLog] [info] writing output

Computing gene-level abundance estimates
[2016-01-02 00:09:04.141] [jointLog] [warning] NOTE: Read Lib [( /nfs/research2/teichmann/valentine/detection-comparison/salmon0.4.2-comparison/mouse/ERP009633_cell20_1.fastq, /nfs/research2/teichmann/valentine/detection-comparison/salmon0.4.2-comparison/mouse/ERP009633_cell20_2.fastq )] :

Greater than 5% of the alignments (but not, necessarily reads) disagreed with the provided library type; check the file: /tmp/ERP009633_cell20_salmon_out/libFormatCounts.txt for details

There were 104534 transcripts mapping to 44034 genes
Parsed 104000 expression lines
done
Aggregating expressions to gene level . . . done
Segmentation fault (core dumped)

Remove antiquated and potentially confusing reporting of "overall" mapping rate.

The "effective" mapping rate is the true mapping rate that is used e.g. in the TPM / estimated number of reads calculations. The overall mapping rate, in addition to being poorly named, is a historical artifact. The proposal here (which should be implemented in the next release) is to report only a single "mapping rate" (which will be equal to what is currently the "effective" mapping rate), that represents the quantities used in comping the TPM and estimated number of reads columns.

Salmon wants pairs in different files

It would be nice to be able to specify sequences in interleaved form as well.

Fragment length as an user input

Could we please input the mean fragment length of the library when using single-end data ?

Write non-error output to stdout

As suggested by Nick Schurch, we should be writing non-error output (including simple logging and informative messages) to stdout rather than stderr.

Error building salmon (possible issue with Boost's iostreams)

Hi,

Building salmon with -DFETCH_BOOST=TRUE ends with an error that might be caused by the build process looking to iostreams in the wrong place.

The end of the ouput on the terminal is:

[ 80%] Built target salmon_core
make -f src/CMakeFiles/salmon.dir/build.make src/CMakeFiles/salmon.dir/depend
make[2]: Entering directory `/opt/local/salmon-index/resources/salmon-0.4.2'
cd /opt/local/salmon-index/resources/salmon-0.4.2 && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /opt/local/salmon-index/resources/salmon-0.4.2 /opt/local/salmon-index/resources/salmon-0.4.2/src /opt/local/salmon-index/resources/salmon-0.4.2 /opt/local/salmon-index/resources/salmon-0.4.2/src /opt/local/salmon-index/resources/salmon-0.4.2/src/CMakeFiles/salmon.dir/DependInfo.cmake --color=
make[2]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make -f src/CMakeFiles/salmon.dir/build.make src/CMakeFiles/salmon.dir/build
make[2]: Entering directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make[2]: *** No rule to make target `/usr/lib/libboost_iostreams-mt.a', needed by `src/salmon'.  Stop.
make[2]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make[1]: *** [src/CMakeFiles/salmon.dir/all] Error 2
make[1]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make: *** [all] Error 2

Earlier in the output I can spot this:

cc1plus: warning: unrecognized command line option "-Wno-deprecated-register" [enabled by default]
make[2]: *** No rule to make target `/usr/lib/libboost_iostreams-mt.a', needed by `src/salmon'.  Stop.
make[2]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make[1]: *** [src/CMakeFiles/salmon.dir/all] Error 2
make[1]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make: *** [all] Error 2

`Error: no such instruction: 'xtest'` when building salmon

Hi,

I am getting the following when building Salmon (gcc/g++ 4.8.2, and boost from -DFETCH_BOOST=TRUE when calling cmake)

(...)
[ 64%] Performing build step for 'libtbb'
cd /opt/local/salmon-index/resources/salmon-0.4.2/external/tbb43_20140724oss && make "CXXFLAGS= -UDO_ITT_NOTIFY" lambdas=1 compiler=gcc cfg=release tbb_build_prefix=LIBS
make[3]: Entering directory `/opt/local/salmon-index/resources/salmon-0.4.2/external/tbb43_20140724oss'
Created ./build/LIBS_release and ..._debug directories
make -C "./build/LIBS_debug"  -r -f ../../build/Makefile.tbb cfg=debug
make[4]: Entering directory `/opt/local/salmon-index/resources/salmon-0.4.2/external/tbb43_20140724oss/build/LIBS_debug'
../../build/Makefile.tbb:31: CONFIG: cfg=debug arch=intel64 compiler=gcc target=linux runtime=cc4.8_libc2.15_kernel4.0.9
g++ -o x86_rtm_rw_mutex.o -c -MMD -DTBB_USE_DEBUG -DDO_ITT_NOTIFY -g -O0 -DUSE_PTHREAD -m64 -mrtm -fPIC -D__TBB_BUILD=1 -Wall -Wno-parentheses -Wno-non-virtual-dtor -UDO_ITT_NOTIFY -std=c++0x -D_TBB_CPP0X  -I../../src -I../../src/rml/include -I../../include ../../src/tbb/x86_rtm_rw_mutex.cpp
/tmp/ccgalJzL.s: Assembler messages:
/tmp/ccgalJzL.s:628: Error: no such instruction: `xtest'
/tmp/ccgalJzL.s:656: Error: no such instruction: `xabort $255'
/tmp/ccgalJzL.s:665: Error: no such instruction: `xabort $255'
/tmp/ccgalJzL.s:671: Error: no such instruction: `xend'
/tmp/ccgalJzL.s:840: Error: no such instruction: `xbegin .L56'
/tmp/ccgalJzL.s:1012: Error: no such instruction: `xbegin .L73'
/tmp/ccgalJzL.s:1269: Error: no such instruction: `xabort $255'
make[4]: *** [x86_rtm_rw_mutex.o] Error 1
make[4]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2/external/tbb43_20140724oss/build/LIBS_debug'
make[3]: *** [tbb] Error 2
make[3]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2/external/tbb43_20140724oss'
make[2]: *** [libtbb-prefix/src/libtbb-stamp/libtbb-build] Error 2
make[2]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make[1]: *** [CMakeFiles/libtbb.dir/all] Error 2
make[1]: Leaving directory `/opt/local/salmon-index/resources/salmon-0.4.2'
make: *** [all] Error 2

Can be fragment start position distribution be inferred with less than 5 million reads?

Are the results from inferring the fragment start distribution really bad for fewer reads?

The data I'm working on usually have ~1 million reads per sample.

Build failure (libtbb)

My first attempt at building salmon failed during what appears to be the tbblib build step. I'm going to continue by building tbb on my own and starting over, but wanted to post notes here. Steps:

$ uname -a
Linux jorvisvm-lx 2.6.18-371.1.2.el5 #1 SMP Mon Oct 7 16:34:35 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

$ pwd
/tmp/salmon-0.6.1-pre

$ mkdir build && cd build

## set up environment:
export PATH=/usr/local/packages/gcc-5.3.0/bin:$PATH

$ cmake -DFETCH_BOOST=TRUE -DCMAKE_INSTALL_PREFIX=/usr/local/packages/salmon-v0.6.1-pre ../
$ make

build.failure.txt

Full length cDNA distribution as a guide to abundances?

A lot of the times when we are assessing our samples before we move on to fragmenting cDNA in to fragments, we look at the distribution of full length cDNA using a Bioanalyzer.

See for example panel a of this figure

With the reference transcriptome, we know the distribution of transcripts with given lengths.

We can view the reference transcript length distribution as unweighted distribution of lengths, and the electropherogram as the distribution when weighing transcript lengths by their relative abundances.

Thus it seems the distribution of full length cDNA could be informative when inferring the TPMs (relative abundances) in a sample.

Do you think it could be possible to integrate with the quantification model?

Salmon indexing on UCSC intron.fa files fail for hg19

Hi all,

So I'm having a similar problem to whats found in issue #49. I'm want to see how much intron RNA is quantified in my experiment, but it seems like the number of "transcripts" (i.e. introns) from the fasta is too many. Any ideas as to how to approach this?

Best

/labseq/tools/SalmonBeta-0.5.0_DebianSqueeze/bin/salmon index -t /labseq/Genomes/introns/hg19.introns.fasta   -i /labseq/Genomes/salmon/SalmonBeta-0.5.0_DebianSqueeze.quasi.intron/  --type quasi

Version Info: ### A newer version of Salmon is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases 
contains **important bug fixes**; please upgrade at your
earliest convenience.
###
index ["/labseq/Genomes/salmon/SalmonBeta-0.5.0_DebianSqueeze.quasi.intron/"] did not previously exist  . . . creating it
[2016-03-28 13:52:29.407] [jointLog] [info] computeBiasFeatures( {[/labseq/Genomes/introns/hg19.introns.fasta] , /labseq/Genomes/salmon/SalmonBeta-0.5.0_DebianSqueeze.quasi.intron/bias_feats.txt, 1, 24)

readFile: /labseq/Genomes/introns/hg19.introns.fasta, 
file /labseq/Genomes/introns/hg19.introns.fasta: 
processed 659300 transcripts (4515) transcripts/snscripts/sipts/s transcripts/sts/s
[2016-03-28 13:54:55.658] [jointLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers
counted k-mers for 650000 transcriptsElapsed time: 128.964s

Clipped poly-A tails from 0 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 5.40482s
Writing sequence data to file . . . done
Elapsed time: 40.1748s
[info] Building 64-bit suffix array (length of generalized text is 4128215243 )
Building suffix array . . . 
success
saving to disk . . . done
Elapsed time: 358.792s
done
Elapsed time: 3065.92s
processed 3810000000 positionssalmon: /home/vagrant/salmon/external/install/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> >; Key = long unsigned int; HashFcn = rapmap::utils::KmerKeyHasher; ExtractKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SelectKey; SetKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SetKey; EqualKey = std::equal_to<long unsigned int>; Alloc = google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion `table' failed.


Aborted

head of fasta file:

>uc001aaa.3_intron_0_0_chr1_12228_f| chr1:12227-12612
GTAAGTAGTGCTTGTGCTCATCTCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGGCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGGTGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCTCTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAGCTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAG
>uc001aaa.3_intron_1_0_chr1_12722_f| chr1:12721-13220
GTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACAAGCAGCAAACAGTCTGCATGGGTCATCCCCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAG
>uc010nxr.1_intron_0_0_chr1_12228_f| chr1:12227-12645
GTAAGTAGTGCTTGTGCTCATCTCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGGCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGGTGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCTCTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAGCTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAG
>uc010nxr.1_intron_1_0_chr1_12698_f| chr1:12697-13220
GTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACAAGCAGCAAACAGTCTGCATGGGTCATCCCCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAG
>uc010nxq.1_intron_0_0_chr1_12228_f| chr1:12227-12594
GTAAGTAGTGCTTGTGCTCATCTCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGGCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGGTGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCTCTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAGCTGGTGATGTGTGGGCCCACCGGCCCCAG

I know it might not be doable in this manner based off what I read in issue #49 but any help would be much appreciated

Documentation currently omits the need for index type

http://sailfish.readthedocs.org/en/latest/salmon.html#using-salmon doesn't talk about the need for a '--type' argument to salmon index.

Salmon indexing on UCSC genome.fa files fail for mm9

I am intending to run salmon on a set of RNA-Seq data lying in our lab for a long time. They are for mm9 and since there are >50 samples I was intending to run it using Salmon version : 0.6.0. I have used earlier versions of salmon on hg19 data from both UCSC, NCBI (spiked-in and non-spiked in data) without alignment mode and have run them successfully. Recently we were able to download the latest version and compile and trying to run the indexing on the UCSC mm9 genome.fa file so that I can use quasi-mapping indexes that can be then used to run quant for my samples downstream so getting read counts as well as TPM much faster than any other tool. Can you tell me what is the problem.

Command line used
salmon index -t /path_to/genome.fa -i salmonquasi-indexes --type quasi -k 31

Here is the error message while using the Ram-Map

Version Info: This is the most recent version of Salmon.
index ["salmonquasi-indexes"] did not previously exist  . . . creating it
[2016-03-17 10:41:34.655] [jointLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers
Elapsed time: 53.9731s

Replaced 96385738 non-ATCG nucleotides
Clipped poly-A tails from 0 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.196609s
Writing sequence data to file . . . done
Elapsed time: 1.56391s
[info] Building 64-bit suffix array (length of generalized text is 2654911539 )
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 126.003s
done
Elapsed time: 883.472s
processed 615000000 positionssalmon: /home/vagrant/salmon/external/install/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> >; Key = long unsigned int; HashFcn = rapmap::utils::KmerKeyHasher; ExtractKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SelectKey; SetKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SetKey; EqualKey = std::equal_to<long unsigned int>; Alloc = google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion table' failed.
Aborted

I also checked the log file and it shows nothing except.

more indexing.log
[2016-03-17 10:41:34.655] [jointLog] [info] building index

output:

-rw-r--r-- 1 vdas DPT          59 Mar 17 10:41 indexing.log
-rw-r--r-- 1 vdas DPT   331863951 Mar 17 10:42 rsd.bin
-rw-r--r-- 1 vdas DPT  2654912013 Mar 17 10:43 txpInfo.bin
-rw-r--r-- 1 vdas DPT 21239292320 Mar 17 10:59 sa.bin

So can you give me a workaround or inputs to solve this issue? Thanks

Remove FPKM from results

R/FPKM are widely misused in practise and in the literature. It has been known since at least 2011 (RSEM paper) that they are not suitable for comparison between samples due to the library-specific normalisation factor. Unfortunately, most people use them that way.

TPM has all the benefits of R/FPKM but has the added benefit that the normalisation factor (1,000,000) is stable between samples. There is therefore no reason to use R/FPKM.

Salmon already promotes good practice in many ways, including reporting TPM. It should further promote good practise by not including FPKM in its results.

[bns_restore_core] Parse error reading

Hi folks,

I'm trying to run DebianSqueeze Salmon v0.4.2 with some issue in the 'quant' step. Here is the skinny:

ERCC + latest human ensembl transcriptome
Index builds fine -- no apparent issues
Quant step fails with the following output:

LD_LIBRARY_PATH=~/software/SalmonBeta-0.4.2_DebianSqueeze/lib; ~/software/SalmonBeta-0.4.2_DebianSqueeze/bin/salmon quant -i index/hs_ens_ercc.sidx --libType IU --output output/salmon -1 reads_1.fastq -2 reads_2.fastq
Version Info: This is the most recent version of Salmon.
# salmon (smem-based) v0.4.2
# [ program ] => salmon
# [ command ] => quant
# [ index ] => { index/hs_ens_ercc.sidx }
# [ libType ] => { IU }
# [ output ] => { output/salmon }
# [ mates1 ] => { reads_1.fastq }
# [ mates2 ] => { reads_2.fastq }
Logs will be written to output/salmon/logs
there is 1 lib
[2015-08-23 21:58:57.438] [jointLog] [info] parsing read library format
[bns_restore_core] Parse error reading index/hs_ens_ercc.sidx/bwaidx.amb

I've provided a reproducible and self-contained Snakefile that only depends on the binaries being dumped in ~/software and the reads_*fastq below.

Let me know if there is anything I can do to help.

Thanks a bunch!

Harold

ercc_fa = 'index/ERCC.fa'
ens_fa = 'index/Homo_sapiens.GRCh38.cdna.all.fa'
ens_ercc_fa = 'index/hs_ens_ercc.fa'
ens_ercc_sidx = 'index/hs_ens_ercc.sidx'

SALMON_PRE = '~/software/SalmonBeta-0.4.2_DebianSqueeze'
SALMON = 'LD_LIBRARY_PATH={0}/lib; {0}/bin/salmon'.format(SALMON_PRE)

rule all:
    input:
        ens_ercc_fa,
        ens_ercc_sidx,
        'output/salmon/quant.sf'

rule download_ens:
    output:
        ens_fa
    params:
        dl = 'ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz'
    threads: 1
    shell:
        'curl {params.dl} | zcat > {output}'

rule download_ercc:
    output:
        ercc_fa
    params:
        dl = 'http://bio.math.berkeley.edu/kallisto/transcriptomes/ERCC.fa.gz'
    threads: 1
    shell:
        'curl {params.dl} | zcat > {output}'

rule merge_ercc:
    input:
        ens_fa,
        ercc_fa
    output:
        ens_ercc_fa
    shell:
        'cat {input[0]} {input[1]} > {output}'

rule sal_ens_ercc:
    input:
        ens_ercc_fa
    output:
        ens_ercc_sidx
    threads: 20
    shell:
        '{SALMON} index -i {output} -p {threads} -t {input}'

rule salmon:
    input:
        'reads_1.fastq',
        'reads_2.fastq',
        ens_ercc_sidx
    output:
        'output/salmon',
        'output/salmon/quant.sf'
    shell:
        '{SALMON} quant '
        '-i {ens_ercc_sidx} '
        '--libType IU '
        '--output {output[0]} '
        '-1 {input[0]} -2 {input[1]}'

Finally, here are the reads:

reads_1.fastq:

@SRR896663.1 FCC0AYTACXX:1:1101:1460:1869 length=100
NAAGTGCTTCATTGTCATCCAACTTCAACTCGTTGACTTTATCTATCAGTCCTTCAATGTCGCCCATACCAAGAAGTTTGCTAATAAAAGGCTGTGTTTT
+SRR896663.1 FCC0AYTACXX:1:1101:1460:1869 length=100
BP\accacggggfiihfhifiihhhiiiefhhhbcfhhhhfghhafffgc_fhf]c]edcafbhfhihihfgfX`ddgggddeadb^ZZ_`accaccccb
@SRR896663.2 FCC0AYTACXX:1:1101:1355:1895 length=100
NTTTGTTTTGAGGTTAGTTTGATTAGTCATTGTTGGGTGGTGATTAGTCGGTTGTTGATGAGATATTTGGAGGTGGGGATCAATAGAGGGGGAAATAGAA
+SRR896663.2 FCC0AYTACXX:1:1101:1355:1895 length=100
BP\cceeeggcegegeefghibgfggghhghhifhhicfdcfafgdggegheghhhf\_g\^Z^]bcceeR_^T\^acPW^ab`bbbbaccaOX[_BBBB
@SRR896663.3 FCC0AYTACXX:1:1101:1663:1907 length=100
NGGGGTCCTCCTTGGTGAACACAAAGCCCACATTCCCCCGGATATGAGGCAGCAGTTTCTCCAGAGCTGGGTTGTTTTCCAGGTGCCCTCGGATGGCCTT
+SRR896663.3 FCC0AYTACXX:1:1101:1663:1907 length=100
BP\ccceegggggihegiiiiiiiighiiiihiihhihiiiiiiiiiiiiiiiggggggeeeeeecbdddcbacaaccccdccbccccccccaac[bccc
@SRR896663.4 FCC0AYTACXX:1:1101:1509:1978 length=100
NCTCAAGCGTTGAGCGGAATGCAGCAATCAATGTCGTCGGAAGATCCTGAATAAATCCTACTGTATCTGAAAGAAGAACACTGTAGCCGCTTGGCAGGAC
+SRR896663.4 FCC0AYTACXX:1:1101:1509:1978 length=100
BS\ceeeegcgcgihiihffgfdghhihiibhhfhheghiihiifhfhifcgggad`ddeeeeccccddYb_bb_]bbcccbcbcbccccccc^bbcc_`

reads_2.fastq:

@SRR896663.1 FCC0AYTACXX:1:1101:1460:1869 length=100
TCAGTGCAGTCGCTGCCACAAAAAGTCCGATTATTTTCATTGGTACAGGGGAACATATAGATGACTTTGAACCTTTCAAAACACAGCCTTTTATTAGCAA
+SRR896663.1 FCC0AYTACXX:1:1101:1460:1869 length=100
___eccccggcggiiiihhhhhhhhbegh]gfhhhhiegfhiifgihiiefffhhiiihihggfggggeeeecbbdddbc`acaacaabbcccccedbbb
@SRR896663.2 FCC0AYTACXX:1:1101:1355:1895 length=100
CCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCAGTACTGATCATTCTATTTCCCCCTCT
+SRR896663.2 FCC0AYTACXX:1:1101:1355:1895 length=100
_abeeeeegggfgiiiiiiiiiiiiiihihihhiiihiiifhiihiiiiihhiiifhhhiigggfeeaeccccaccccdccccdcccbddddcbccaccc
@SRR896663.3 FCC0AYTACXX:1:1101:1663:1907 length=100
GTCCCTTCGCGGGAAGGCTGTGGTGCTGATGGGCAAGAACACCATGATGCGCAAGGCCATCCGAGGGCACCTGGAAAACAACCCAGCTCTGGAGAAACTG
+SRR896663.3 FCC0AYTACXX:1:1101:1663:1907 length=100
b_beeeeegggggiiiiiiihhhfgiiihiihiiiiihhiiiiiigggggeeecccaccccccccccccccccccccccccccccccccccbba`acccb
@SRR896663.4 FCC0AYTACXX:1:1101:1509:1978 length=100
AGAAAAATGGTCCTGCCAAGCGGCTACAGTGTTCTTCTTTCAGATACAGTAGGATTTATTCAGGATCTTCCGACGACATTGATTGCTGCATTCCGCTCAA
+SRR896663.4 FCC0AYTACXX:1:1101:1509:1978 length=100
[__ceceegfgggiihiiiiiiihihifhbf^fghhggfhgdhhifhfghaegfaghhhfhhfihihihdgggecccccccbdccccccccdcccccccb

boost math error during EM iteration: Evaluation of function at pole -nan

Hi Rob,

I've been running another group's samples (single-end, second-strand protocol), and I have a script that iterates through each sample and runs salmon. I'm running the latest version (0.6.0) with the following arguments: salmon quant -i salmon_index --libType SF -r <(gzip -c -d $IN_FILE) -o $OUTPUT --numBootstraps 100 --useVBOpt --useFSPD --geneMap $GENES --biasCorrect -p 59

During the EM iteration step (soon after the 500th round, when salmon recalculates effective lengths), I get this error:

[jointLog] [info] iteration 500, recomputing effective lengths
[jointLog] [info] iteration = 500 | max rel diff. = 64.1299
Exception : [Error in function boost::math::digamma<long double>(long double): Evaluation of function at pole -nan]
salmon quant was invoked improperly.
For usage information, try salmon quant --help
Exiting.

I can't tell if this is just a regular possible occurrence with the non-deterministic algorithm or if this is never supposed to happen. These particular samples are extremely high depth (about ~170-190M reads per sample), so that might be the cause, but I don't understand enough of how the algorithm works to know how to troubleshoot or to put together a toy dataset that reproduces the error. Rerunning the sample that causes the error often works.

If I can throw in a feature request here, it would be great to be able to set the seed to make the runs deterministic. Is that possible?

Library format checking for stranded libraries gives uninformative debugging messages

I've just been testing Salmon using Illumina TruSeq stranded (dUTP) libraries. Using either ISF (correct libType) or ISR I get a ton of messages like this:

expected = Library format { type: }
paired end, relative orientation:inwardexpected = Library format { type:paired end, relative orientation:Library format { type:paired end, relative orientation:inwardpaired endinward, strandedness:, relative orientation:inward, strandedness:(sense, antisense) }
observed = Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }
(sense, antisense)expected = , strandedness:(sense, antisense), strandedness:(sense, antisense) }
observed =  }
Library format { type:paired end, relative orientation:inward, strandedness:(sense, antisense) }
observed = Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }
expected = Library format { type:observed = Library format { type: }
paired endLibrary format { type:paired end, relative orientation:, relative orientation:inward, strandedness:(sense, antisense) }
observed = Library format { type:paired end, relative orientation:inward, strandedness:observed = inward, strandedness:(antisense, sense) }
(antisense, sense) }paired end, relative orientation:inward, strandedness:(antisense, sense) }
expected = Library format { type:paired end, relative orientation:inward, strandedness:(sense, antisense)expected = Library format { type:Library format { type:
 }paired endpaired endexpected = Library format { type:paired end, relative orientation:
, relative orientation:inwardinward, strandedness:(sense, antisense)observed = , strandedness:(antisense, sense) }
 }
observed = Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }expected = Library format { type:, relative orientation:Library format { type:paired end

and so on...

It seems that the LibraryFormat class is performing this check, and that the string format method is producing all of this output on stderr.

I believe that the Salmon index I'm using contains some transcript sequences from the "wrong" strand, and it would be helpful if the program gave information about the observed mapping (maybe just the transcript sequence ID) so that I could track down the error during index generation.

It also seems that all of the above error output contains no line terminators, although maybe this has been fixed in a more recent version.

-bash-4.1$ salmon --version
version : 0.4.0

cc @jmerkin

Significant performance regression in version 0.4.0

I tend to benchmark new versions of software mostly to check how much better things get over time at solving our problems. My strategy for benchmarking is to look at correlation between spike-ins at known abundances and estimated expression by software.

The latest version of Salmon (0.4.0) performs markedly worse than all the previous versions of Salmon on the same data.

For running parameters, here is the top part of one of the quant.sf files

# salmon (smem-based) v0.4.0
# [ program ] => salmon 
# [ command ] => quant 
# [ index ] => { /nfs/research2/teichmann/reference/homo-sapiens/salmon/Homo_sapiens.GRCh38.78.cdna_ERCC }
# [ libType ] => { IU }
# [ threads ] => { 4 }
# [ mates1 ] => { /nfs/research2/teichmann/valentine/detection-comparison/salmon-comparison/human/SRP030617_HCT116_86_1.fastq }
# [ mates2 ] => { /nfs/research2/teichmann/valentine/detection-comparison/salmon-comparison/human/SRP030617_HCT116_86_2.fastq }
# [ output ] => { /tmp/SRP030617_HCT116_86_salmon_out }
# [ geneMap ] => { /nfs/research2/teichmann/reference/homo-sapiens/Homo_sapiens.GRCh38.78.cdna_ERCC.gene_map.txt }
# [ useVBOpt ] => { }
# [ mapping rate ] => { 48.8199% }

Salmon ignores "S" in --libtype ISF

Hi, we're having trouble with salmon on a stranded library -- we're executing

salmon index --index idx --transcript equCabs.fa --type quasi

and then

salmon quant -i idx -1 leftReads.fq -2 rightReads.fq --libType ISF -o xxx.quant

We then get the warning that there is a big strand bias in an unstranded protocol, despite having specified stranded (S) in libType. Inspection of the libFormatCounts.txt file confirms this, with "expected format" specifying "strandedness:unstranded".

Gene fusions

Wicked fast indeed! Are there any plans to extend salmon to also detect gene fusion events? There isn't a fast and accurate way to do that yet, only approaches requiring full alignments. Most often a base-perfect breakpoint isn't required, an estimate within a hash length is fine. We are a heavy user of bcbio and are also running the full STAR alignment just for gene fusions, which really sucks. Any ideas would be much appreciated.

Segfault

I've switched to using precompiled binaries, version 0.6.0. Now working on a new server running CentOS Linux release 7.1.1503. I was able to successfully generate my index, then started running the quantification step. Here is my command:

$ /home/jorvis/salmon/bin/salmon quant -p 24 -i transcripts_index -l IU -1 R1.trimmed.PE.fastq -2 R2.trimmed.PE.fastq -o transcripts_quan

This host has 48 cores and 128GB RAM.

And here is the STDOUT

Version Info: This is the most recent version of Salmon.
# salmon (mapping-based) v0.6.0
# [ program ] => salmon
# [ command ] => quant
# [ threads ] => { 24 }
# [ index ] => { transcripts_index }
# [ libType ] => { IU }
# [ mates1 ] => { R1.trimmed.PE.fastq }
# [ mates2 ] => { R2.trimmed.PE.fastq }
# [ output ] => { transcripts_quan }
Logs will be written to transcripts_quan/logs
[2016-03-30 15:50:48.489] [jointLog] [info] parsing read library format
there is 1 lib
Loading 64-bit quasi index[2016-03-30 15:50:48.543] [jointLog] [info] Loading Quasi index
[2016-03-30 15:50:48.544] [stderrLog] [info] Loading Suffix Array
[2016-03-30 15:50:48.544] [stderrLog] [info] Loading Position Hash
[2016-03-30 15:50:58.359] [stderrLog] [info] Loading Transcript Info
[2016-03-30 15:50:59.932] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-03-30 15:51:00.610] [stderrLog] [info] There were 2027284 set bits in the bit array
[2016-03-30 15:51:00.917] [stderrLog] [info] Computing transcript lengths
[2016-03-30 15:51:00.925] [stderrLog] [info] Waiting to finish loading hash
Index contained 2027284 targets
[2016-03-30 15:51:08.499] [jointLog] [info] done
[2016-03-30 15:51:08.499] [stderrLog] [info] Done loading index




Segmentation fault

The only log file I see is this one: transcripts_quan/logs/salmon_quant.log

$ cat salmon_quant.log
[2016-03-30 15:50:48.489] [jointLog] [info] parsing read library format
[2016-03-30 15:50:48.543] [jointLog] [info] Loading Quasi index
[2016-03-30 15:51:08.499] [jointLog] [info] done

error Installing: CMake fail line 94; RapMap not found (with workaround)

Hi,

First of all, thank you very much for this tool! I'm excited to try it after reading your bioRxiv paper and your blog post. Following your installation instructions, I got to the step where I run CMake [flags] .., and it failed with the following error:

CMake Error at src/CMakeLists.txt:94 (add_executable):
  Cannot find source file:

    /home/wulab2015linux/warren/Software/salmon/external/install/src/rapmap/RapMapFileSystem.cpp

  Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
  .hxx .in .txx

I realized that I did not have curl installed (I use wget all of the time, on a Ubuntu 14.04.4 system), so the fetchRapMap.sh script failed. I tried rerunning the Cmake command after installing curl with apt-get, and it still failed. However, the script worked after I manually ran fetchRapMap.sh, and the build was fine after that.

I'm a complete noob when it comes to CMake files, so I don't know what happened, but I wanted to bring this to your attention and provide users that might have the same issue with a workaround.

Salmon fails to match the transcript name between Gencode reference and annotation files

The transcript names in Gencode's reference sequence fasta files have the following format:
ENST00000257408.4|ENSG00000134962.6|OTTHUMG00000128577.1|OTTHUMT00000250429.1|KLB-001|KLB|6082|UTR5:1-97|CDS:98-3232|UTR3:3233-6082|

In the .gtf gene annotation files, only the transcript name appears:
ENST00000257408.4

As a consequence, salmon fails to match them and does not report the correct values in quant.genes.sf. Values in quant.sf seem to be correct though.

Nico

v.low strandness reported wrt genomic alignment w/ RSeQC

Thanks for your work on Salmon- interesting tool!

So, I'm running Salmon, and comparing it with a more conventional genomic alignment with HISAT2. I have TruSeq stranded data, and if I run HISAT2 and check strandedness using RSeQC's 'infer_experiment.py' and an Ensembl GTF (this is rat data btw), I get >98% as I expect.

Here's an example libFormatCounts.txt from running Salmon 0.6 with the same reads, and a FASTA reference combining Ensembl's cDNA and ncRNA files:

Expected format: Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }

# of consistent alignments: 73805886
# of inconsistent alignments: 34539736

========
---- counts for each format type ---
Library format { type:single end, relative orientation:matching, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:matching, strandedness:(sense, antisense) } : 0
Library format { type:single end, relative orientation:outward, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:outward, strandedness:(sense, antisense) } : 4578
Library format { type:single end, relative orientation:inward, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:inward, strandedness:(sense, antisense) } : 11494138
Library format { type:single end, relative orientation:none, strandedness:(sense, antisense) } : 0
Library format { type:paired end, relative orientation:none, strandedness:(sense, antisense) } : 0
Library format { type:single end, relative orientation:matching, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:matching, strandedness:(antisense, sense) } : 0
Library format { type:single end, relative orientation:outward, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:outward, strandedness:(antisense, sense) } : 64175
Library format { type:single end, relative orientation:inward, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) } : 73805886
Library format { type:single end, relative orientation:none, strandedness:(antisense, sense) } : 0
Library format { type:paired end, relative orientation:none, strandedness:(antisense, sense) } : 0
Library format { type:single end, relative orientation:matching, strandedness:sense } : 0
Library format { type:paired end, relative orientation:matching, strandedness:sense } : 525
Library format { type:single end, relative orientation:outward, strandedness:sense } : 0
Library format { type:paired end, relative orientation:outward, strandedness:sense } : 0
Library format { type:single end, relative orientation:inward, strandedness:sense } : 0
Library format { type:paired end, relative orientation:inward, strandedness:sense } : 0
Library format { type:single end, relative orientation:none, strandedness:sense } : 12094116
Library format { type:paired end, relative orientation:none, strandedness:sense } : 0
Library format { type:single end, relative orientation:matching, strandedness:antisense } : 0
Library format { type:paired end, relative orientation:matching, strandedness:antisense } : 6539
Library format { type:single end, relative orientation:outward, strandedness:antisense } : 0
Library format { type:paired end, relative orientation:outward, strandedness:antisense } : 0
Library format { type:single end, relative orientation:inward, strandedness:antisense } : 0
Library format { type:paired end, relative orientation:inward, strandedness:antisense } : 0
Library format { type:single end, relative orientation:none, strandedness:antisense } : 10875665
Library format { type:paired end, relative orientation:none, strandedness:antisense } : 0
Library format { type:single end, relative orientation:matching, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:matching, strandedness:unstranded } : 0
Library format { type:single end, relative orientation:outward, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:outward, strandedness:unstranded } : 0
Library format { type:single end, relative orientation:inward, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:inward, strandedness:unstranded } : 0
Library format { type:single end, relative orientation:none, strandedness:unstranded } : 0
Library format { type:paired end, relative orientation:none, strandedness:unstranded } : 0

As you can see the data looks much less stranded here!

Is this a feature of how Salmon is operating, or a problem with the FASTA reference? Is it something I should be concerned about?

Building index failed [std::bad_alloc]

Hi,

I have been trying to build an index for transcripts, it is composed of GENCODE and several unannotated transcripts from different sources(~95k, non-overlapping). Used following command

salmon index -t cgm_annotation.fa -i cgm_index --type quasi

Version Info: This is the most recent version of Salmon.
[2016-01-29 13:44:08.001] [jointLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers
counted k-mers for 90000 transcriptsElapsed time: 27.9634s

Replaced 4303414 non-ATCG nucleotides
Clipped poly-A tails from 226 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.243976s
Writing sequence data to file . . . done
Elapsed time: 10.3463s
[info] Building 64-bit suffix array (length of generalized text is 2398906742 )
Elapsed time: 0.000125994s
Exception : [std::bad_alloc]
salmon index was invoked improperly.
For usage information, try salmon index --help
Exiting.

attempts to switch to different downloaded tarball for jellyfish fail

On the system I'm trying to install Salmon 0.4.2 on, downloading from ftp:// is blocked for security reasons. Therefore, the automagic download the Jellyfish 2.1.3 source tarball fails.

I tried adjusting the CMakeLists.txt files to use the latest Jellyfish 2.2.3 instead, but this leads to an issue with a Jellyfish include file not being found:

checking whether we are using the GNU C compiler... In file included from /tmp/Salmon/0.4.2/intel-2015a/salmon-0.4.2/src/merge_files.cc(17):
/tmp/Salmon/0.4.2/intel-2015a/salmon-0.4.2/include/merge_files.hpp(21): catastrophic error: cannot open source file "jellyfish/err.hpp"
  #include <jellyfish/err.hpp>
                              ^

compilation aborted for /tmp/Salmon/0.4.2/intel-2015a/salmon-0.4.2/src/merge_files.cc (code 4)
make[2]: *** [src/CMakeFiles/salmon_core.dir/merge_files.cc.o] Error 4

This is weird, because the correct include directory is shown in the compiler command, and the file is there!

Here's my patch. Any idea what may be wrong with it, or which different approach I could try to get this to work?
I also tried using the 2.1.3.tar.gz tarball from GitHub, but after adding autoreconf -i to the CONFIGURE_COMMAND, this leads to the same problem.

--- salmon-0.4.2/CMakeLists.txt.orig    2015-06-15 02:31:09.000000000 +0200
+++ salmon-0.4.2/CMakeLists.txt 2015-08-18 21:13:29.684010359 +0200
@@ -357,14 +366,14 @@
 message("==================================================================")
 ExternalProject_Add(libjellyfish
     DOWNLOAD_DIR ${CMAKE_CURRENT_SOURCE_DIR}/external
-    URL ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
-    SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/external/jellyfish-2.1.3
+    URL https://github.com/gmarcais/Jellyfish/releases/download/v2.2.3/jellyfish-2.2.3.tar.gz
+    SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/external/jellyfish-2.2.3
     INSTALL_DIR ${CMAKE_CURRENT_SOURCE_DIR}/external/install
-    CONFIGURE_COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/external/jellyfish-2.1.3/configure --prefix=<INSTALL_DIR> CC=${CMAKE_C_COMPILER} CXX=${CMAKE_CXX_COMPILER} CXXFLAGS=${JELLYFISH_CXX_FLAGS}
+    CONFIGURE_COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/external/jellyfish-2.2.3/configure --prefix=<INSTALL_DIR> CC=${CMAKE_C_COMPILER} CXX=${CMAKE_CXX_COMPILER} CXXFLAGS=${JELLYFISH_CXX_FLAGS}
     BUILD_COMMAND ${MAKE} CC=${CMAKE_C_COMPILER} CXX=${CMAKE_CXX_COMPILER} CXXFLAGS=${JELLYFISH_CXX_FLAGS}
     BUILD_IN_SOURCE 1
     INSTALL_COMMAND make install && 
-                    cp config.h <INSTALL_DIR>/include/jellyfish-2.1.3/jellyfish/ &&
+                    cp config.h <INSTALL_DIR>/include/jellyfish-2.2.3/jellyfish/ &&
                     cp config.h <INSTALL_DIR>/include/
 )

--- salmon-0.4.2/src/CMakeLists.txt.orig    2015-08-18 21:21:14.892734948 +0200
+++ salmon-0.4.2/src/CMakeLists.txt 2015-08-18 21:20:51.292295094 +0200
@@ -42,7 +42,7 @@
 ${GAT_SOURCE_DIR}/external
 ${GAT_SOURCE_DIR}/external/cereal/include
 ${GAT_SOURCE_DIR}/external/install/include
-${GAT_SOURCE_DIR}/external/install/include/jellyfish-2.1.3
+${GAT_SOURCE_DIR}/external/install/include/jellyfish-2.2.3
 ${GAT_SOURCE_DIR}/external/install/include/bwa
 ${ZLIB_INCLUDE_DIR}
 ${TBB_INCLUDE_DIRS}

error building

I'm trying to build version 0.6 and I get the following error:

cd /root/soft/salmon/salmon-0.6.0/build/src && /usr/local/bin/cmake -E cmake_link_script CMakeFiles/salmon.dir/link.txt --verbose=1
/opt/rh/devtoolset-2/root/usr/bin/c++   -pthread -funroll-loops -fPIC -fomit-frame-pointer -Ofast -DRAPMAP_SALMON_SUPPORT -DHAVE_ANSI_TERM -DHAVE_SSTREAM -Wall -Wno-reorder -Wno-unused-variable -std=c++11 -Wreturn-type -Werror=return-type -static-libstdc++ -Wno-deprecated-register -Wno-unused-local-typedefs   -L/opt/rh/devtoolset-2/root/usr/lib64 -L/opt/rh/devtoolset-2/root/usr/lib CMakeFiles/salmon.dir/QSufSort.c.o CMakeFiles/salmon.dir/is.c.o CMakeFiles/salmon.dir/bwt_gen.c.o CMakeFiles/salmon.dir/bwtindex.c.o CMakeFiles/salmon.dir/xxhash.c.o CMakeFiles/salmon.dir/CollapsedEMOptimizer.cpp.o CMakeFiles/salmon.dir/CollapsedGibbsSampler.cpp.o CMakeFiles/salmon.dir/Salmon.cpp.o CMakeFiles/salmon.dir/BuildSalmonIndex.cpp.o CMakeFiles/salmon.dir/SalmonQuantify.cpp.o CMakeFiles/salmon.dir/FragmentLengthDistribution.cpp.o CMakeFiles/salmon.dir/FragmentStartPositionDistribution.cpp.o CMakeFiles/salmon.dir/SequenceBiasModel.cpp.o CMakeFiles/salmon.dir/StadenUtils.cpp.o CMakeFiles/salmon.dir/TranscriptGroup.cpp.o CMakeFiles/salmon.dir/GZipWriter.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/RapMapFileSystem.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/RapMapSAIndexer.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/RapMapSAIndex.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/RapMapSAMapper.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/RapMapUtils.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/HitManager.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/rank9b.cpp.o CMakeFiles/salmon.dir/__/external/install/src/rapmap/bit_array.c.o CMakeFiles/salmon.dir/FASTAParser.cpp.o CMakeFiles/salmon.dir/ErrorModel.cpp.o CMakeFiles/salmon.dir/AlignmentModel.cpp.o CMakeFiles/salmon.dir/SalmonQuantifyAlignments.cpp.o  -o salmon  -L/root/soft/salmon/salmon-0.6.0/lib  -L/root/soft/salmon/salmon-0.6.0/external/install/lib  -L/opt/boost/boost_1_57_0/lib -rdynamic libsalmon_core.a -lgff -lpthread /opt/boost/boost_1_57_0/lib/libboost_iostreams.a /opt/boost/boost_1_57_0/lib/libboost_filesystem.a /opt/boost/boost_1_57_0/lib/libboost_system.a /opt/boost/boost_1_57_0/lib/libboost_thread.a /opt/boost/boost_1_57_0/lib/libboost_timer.a /opt/boost/boost_1_57_0/lib/libboost_chrono.a /opt/boost/boost_1_57_0/lib/libboost_program_options.a /opt/boost/boost_1_57_0/lib/libboost_serialization.a ../../external/install/lib/libstaden-read.a -lz ../../external/install/lib/libdivsufsort.a ../../external/install/lib/libdivsufsort64.a ../../external/install/lib/libjellyfish-2.0.a ../../external/install/lib/libbwa.a -lm ../../external/install/lib/liblzma.a -lbz2 -ltbb -ltbbmalloc -lgomp -lrt ../../external/install/lib/libjemalloc.a -Wl,-rpath,"\$ORIGIN/../lib:\$ORIGIN/../../lib:\$ORIGIN/:\$ORIGIN/../../external/install/lib"
../../external/install/lib/libstaden-read.a(libstaden_read_la-open_trace_file.o): In function `find_file_url':
open_trace_file.c:(.text+0xd26): warning: the use of `tempnam' is dangerous, better use `mkstemp'
CMakeFiles/salmon.dir/SalmonQuantifyAlignments.cpp.o: In function `tbb::strict_ppl::internal::concurrent_queue_base_v3<UnpairedRead*>::internal_push(void const*, void (*)(UnpairedRead**, void const*)) [clone .constprop.1465]':
SalmonQuantifyAlignments.cpp:(.text+0x136d): undefined reference to `tbb::internal::throw_exception_v4(tbb::internal::exception_id)'
CMakeFiles/salmon.dir/SalmonQuantifyAlignments.cpp.o: In function `tbb::strict_ppl::internal::concurrent_queue_base_v3<ReadPair*>::internal_push(void const*, void (*)(ReadPair**, void const*)) [clone .constprop.1477]':
SalmonQuantifyAlignments.cpp:(.text+0x161d): undefined reference to `tbb::internal::throw_exception_v4(tbb::internal::exception_id)'

I'm using
gcc 4.8.2
boost 1.5.7
icc 15.0.0 20140723

Silent failure while loading index

I have a salmon index which fails silently when used:

Version Info: Could not resolve upgrade information in the alotted time.
Check for upgrades manually at https://combine-lab.github.io/salmon
# salmon (mapping-based) v0.5.1
# [ program ] => salmon
# [ command ] => quant
# [ index ] => { ... }
# [ libType ] => { IU }
# [ mates1 ] => { ... }
# [ mates2 ] => { ... }
# [ output ] => { ... }
# [ threads ] => { 16 }
Logs will be written to ...
[2016-01-22 16:54:55.564] [jointLog] [info] parsing read library format
there is 1 lib
Loading 32-bit quasi index
[2016-01-22 16:54:56.303] [jointLog] [info] Loading Quasi index
[2016-01-22 16:54:56.320] [stderrLog] [info] Loading Suffix Array
[2016-01-22 16:54:56.321] [stderrLog] [info] Loading Position Hash
[2016-01-22 16:56:17.595] [stderrLog] [info] Loading Transcript Info
[2016-01-22 16:56:36.767] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-01-22 16:56:40.858] [stderrLog] [info] There were 552702 set bits in the bit array
[2016-01-22 16:56:41.758] [stderrLog] [info] Computing transcript lengths
[2016-01-22 16:56:41.761] [stderrLog] [info] Waiting to finish loading hash
Index contained 552702 targets
[2016-01-22 17:00:40.648] [stderrLog] [info] Done loading index
[2016-01-22 17:00:40.648] [jointLog] [info] done

Then the process exits and nothing but the cmd_info.json and log file are written to disk. The sequencing library is not an issue, as I can use several other index files successfully. This is reproducible with ~600 sequencing libraries as well. I believe this also occurs using v0.6.0, but will confirm.

Since there is no core dump, is there any way for me to debug this?

Please support Intel (compiler -- icc >= 14.0)

The CMakeLists.txt refuses to handle anything except GNU and Clang compilers. It should be easy to support Intel as well since it understands most/all(?) GNU compiler flags.

CMAKE will report the compiler type as "Intel" and C++11 is supported from 14.0 and forward.

My rough wip to get this building uses the following:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index c95f755..30f1223 100755
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -118,8 +118,30 @@ elseif ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
     else()
         set (PTHREAD_LIB "pthread")
     endif()
+elseif ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Intel")
+    execute_process(
+        COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE INTEL_VERSION)
+    if (NOT (INTEL_VERSION VERSION_GREATER 14.0 OR INTEL_VERSION VERSION_EQUAL 14.0))
+        message(FATAL_ERROR "${PROJECT_NAME} requires intel 14.0 or greater.  Found ${INTEL_VERSION}")
+    endif ()
+
+    set (INTEL TRUE)
+    set (PTHREAD_LIB "pthread")
+    set (CMAKE_CXX_FLAGS "-pthread -funroll-loops -fPIC -fomit-frame-pointer -Ofast -DHAVE_ANSI_TERM -DHAVE_SSTREAM -Wall -std=c++11 -Wreturn-type -Werror=return-type")
+
+    # If we're on Linux (i.e. not OSX) and we're using 
+    # gcc, then set the -static-libstdc++ flag
+    if (NOT APPLE) 
+        set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++")
+    endif()
+
+    set (WARNING_IGNORE_FLAGS "${WARNING_IGNORE_FLAGS} -Wno-unused-local-typedefs")
+    set (BOOST_TOOLSET "intel")
+    set (BOOST_CONFIGURE_TOOLSET "--with-toolset=gcc")
+       set (BCXX_FLAGS "-std=c++11")
+    set (BOOST_EXTRA_FLAGS toolset=gcc cxxflags=${BCXX_FLAGS})
 else ()
-    message(FATAL_ERROR "Your C++ compiler does not support C++11.")
+    message(FATAL_ERROR "Your C++ compiler (${CMAKE_CXX_COMPILER_ID}) does not support C++11.")
 endif ()

 ## TODO: Figure out how to detect this automatically

upgrade CMakeLists.txt to use external copies of dependencies & no downloading

Hello!

I'm packaging salmon and many of its dependencies for Debian in support of blahah/transrate#160

I have a messy patch to enable the use of external libraries instead of bundled or downloaded copies at http://anonscm.debian.org/cgit/debian-med/salmon.git/plain/debian/patches/dependency-fix

As I'm not a CMake expert I was only able to make it work for Debian instead of a generic solution that would fall back to the shipped copies or downloading as it is now.

Perhaps you all are better with CMake than I am? A generic solution would be best so I don't have to adjust the patch with every change to the CMakeLists.txt

Specifically it would be great to support

external copy of spdlog headers and the cereal serialization headers
linking to external dynamic libraries for
- boost
  - Headers: https://packages.debian.org/sid/amd64/libboost1.58-dev/filelist
  - Sample library layout: https://packages.debian.org/sid/amd64/libboost-filesystem1.58-dev/filelist
- bwa
  - Headers & library: https://ftp-master.debian.org/new/bwa_0.7.12-5.html
- jellyfish
  - Headers & library: https://packages.debian.org/sid/amd64/libjellyfish-2.0-dev/filelist
- tbb
  - Headers & library: https://packages.debian.org/sid/amd64/libtbb-dev/filelist
- libgff
  - Headers & library: https://ftp-master.debian.org/new/libgff_1.0-1.html
- staden IOLib
  - Headers & library: https://packages.debian.org/sid/amd64/libstaden-read-dev/filelist

I also have a patch to support the latest release of jellyfish that I can turn into a pull request, should you want it: http://anonscm.debian.org/cgit/debian-med/salmon.git/plain/debian/patches/jellyfish-update

Thanks!

undefined reference to boost::program_options::validate

Here's a gist of the logs:
https://gist.github.com/sjackman/8b0c2be77efeb9507ca3#file-02-make-L4938

Throw error if reads are less than k-mer size

If the read length in the FASTQ input files to Salmon are less than the k-mer size used when building the index then no error is produced but a mapping rate of 0% is reported. It would be useful for Salmon to throw an error telling the user that the mapping will surely fail for this reason (as opposed to others such as wrong reference transcriptome or wrong library type), and suggest rebuilding the index using a smaller k-mer size.

Use apparent transcript length rather than actual transcript length (feature request)

When calculating the TPM, it may be an idea to use the length of the transcript that has reads mapped to it rather than the FASTA length of the transcript. It may be difficult to define "the length of the transcript that has reads mapped to it" or require choosing arbitrary thresholds to define what portion of the transcript is transcribed.

See https://twitter.com/sjackman/status/620984740150030336

mapping rate calculations warning

Hi,

I am running salom 0.6.0 on a Ubuntu server. here is my command as well as the STDOUT output:

salmon quant -p 16 --biasCorrect --libType IU -i ~./Salmon/Salmon.index/Homo_sapiens.GRCh38.rel79/ --numBootstraps 100  -o $base <(zcat ${base}_1.fastq.gz ) <(zcat ${base}_2.fastq.gz)
Version Info: This is the most recent version of Salmon.
# salmon (mapping-based) v0.6.0   
# [ program ] => salmon
# [ command ] => quant
# [ threads ] => { 16 }
# [ biasCorrect ] => { }
# [ libType ] => { IU }
# [ index ] => { ./Salmon/Salmon.index/Homo_sapiens.GRCh38.rel79/ }
# [ numBootstraps ] => { 100 }
# [ output ] => { 61LP1AAXX_8 }   
# [  ] => { /dev/fd/63 }
# [  ] => { /dev/fd/62 }
Logs will be written to 61LP1AAXX_8/logs
there is 0[2016-07-11 09:51:45.206] [jointLog] [info] parsing read library format
 lib
Loading 32-bit quasi index[2016-07-11 09:51:45.328] [jointLog] [info] Loading Quasi index
[2016-07-11 09:51:45.736] [stderrLog] [info] Loading Suffix Array
[2016-07-11 09:51:45.771] [stderrLog] [info] Loading Position Hash
[2016-07-11 09:52:13.781] [stderrLog] [info] Loading Transcript Info
[2016-07-11 09:52:20.821] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-07-11 09:52:21.877] [stderrLog] [info] There were 173259 set bits in the bit array
[2016-07-11 09:52:22.030] [stderrLog] [info] Computing transcript lengths
[2016-07-11 09:52:22.030] [stderrLog] [info] Waiting to finish loading hash
Index contained 173259 targets
[2016-07-11 09:52:26.970] [jointLog] [info] done
[2016-07-11 09:52:26.970] [stderrLog] [info] Done loading index

[2016-07-11 09:52:27.327] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2016-07-11 09:52:27.327] [jointLog] [info] Counted 0 total reads in the equivalence classes 
[2016-07-11 09:52:39.858] [jointLog] [warning] Only 0 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

**[2016-07-11 09:52:39.858] [jointLog] [warning] Something seems to be wrong with the calculation of the mapping rate.  The recorded ratio is likely wrong.  Please file this as a bug report.**

[2016-07-11 09:52:39.858] [jointLog] [info] Mapping rate = 0%

[2016-07-11 09:52:39.858] [jointLog] [info] finished quantifyLibrary()
[2016-07-11 09:52:39.858] [jointLog] [info] Starting optimizer
[2016-07-11 09:52:39.894] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-07-11 09:52:39.895] [jointLog] [info] iteration = 0 | max rel diff. = -1.79769e+308
[2016-07-11 09:52:39.921] [jointLog] [info] iteration = 50 | max rel diff. = -1.79769e+308
[2016-07-11 09:52:39.932] [jointLog] [info] Finished optimizer
[2016-07-11 09:52:39.932] [jointLog] [info] writing output

I am not exactly sure where the problem is. The output file is created, but all the transcripts are '0'

Any ideas, what was done wrong?

thanks

Assa

PS.
this is how I have indexed the data

salmon index -t Homo_sapiens.GRCh38.rel79.cdna.all.fa -i Homo_sapiens.GRCh38.rel79 --type quasi -k 31 -p 24

it is is also important.

Salmon depends on Staden depends on xz for lzma.h

https://travis-ci.org/Homebrew/homebrew-science/jobs/114404533#L1110-L1115

/bin/sh ../libtool  --tag=CC   --mode=compile /usr/local/Library/ENV/4.3/clang -DHAVE_CONFIG_H -I. -I.. -I..    -I/usr/local/include -L/usr/local/lib -MT libstaden_read_la-cram_io.lo -MD -MP -MF .deps/libstaden_read_la-cram_io.Tpo -c -o libstaden_read_la-cram_io.lo `test -f 'cram_io.c' || echo './'`cram_io.c
libtool: compile:  /usr/local/Library/ENV/4.3/clang -DHAVE_CONFIG_H -I. -I.. -I.. -I/usr/local/include -L/usr/local/lib -MT libstaden_read_la-cram_io.lo -MD -MP -MF .deps/libstaden_read_la-cram_io.Tpo -c cram_io.c -o libstaden_read_la-cram_io.o
cram_io.c:66:10: fatal error: 'lzma.h' file not found
#include <lzma.h>
         ^
1 error generated.
make[5]: *** [libstaden_read_la-cram_io.lo] Error 1
make[4]: *** [all-recursive] Error 1
make[3]: *** [all] Error 2
make[2]: *** [libstadenio-prefix/src/libstadenio-stamp/libstadenio-build] Error 2
make[1]: *** [CMakeFiles/libstadenio.dir/all] Error 2
make: *** [all] Error 2

"you are using an empty matrix" error, seemingly during bias correction (--biasCorrect)

I'm running salmon v.4 (downloaded and compiled today) on gencode v22 and got the following error:

Performing PCA decomposition
salmon: /home/merkija1/software/salmon-0.4.0/include/eigen3/Eigen/src/Core/Redux.h:202: static Eigen::internal::redux_impl<Func, Derived, 3, 0>::Scalar Eigen::internal::redux_impl<Func, Derived, 3, 0>::run(const Derived&, const Func&) [with Func = Eigen::internal::scalar_sum_op; Derived = Eigen::Block<const Eigen::Matrix<double, -1, -1>, -1, 1, true>; Eigen::internal::redux_impl<Func, Derived, 3, 0>::Scalar = double]: Assertion `size && "you are using an empty matrix"' failed.
Aborted

The command I ran is:
salmon-0.4.0/src/salmon quant --index gencode.v22.index_0.4.0/ --mates1 <(gunzip -c r1_fq1.gz r1_fq2.gz --mates2 <(gunzip -c r2_fq1.gz r2_fq2.gz ) --output $OUTPUT_DIR --biasCorrect --threads 4 --geneMap gencode.v22.annotation.nochr.gtf --libType "ISF"

If I remove the --biasCorrect flag, it runs without error.

bootstrap results

Hi,

I used the flag --numBootstraps 100, but I don't get bootstrap results in my output folder. There is a aux/bootstrap folder with two gz files, but they do not look like to be the results.

My command:
{
"salmon_version": "0.6.0",
"index": "index_salmon_quasi",
"libType": "ISF",
"mates1": "R1.fastq",
"mates2": "R2.fastq",
"output": "K562",
"numBootstraps": "100",
"threads": "10"
}

error installing: CMake Error: Problem with archive_write_finish_entry(): Can't restore time

I get the following error when installing salmon:

cd /var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/external && /gsc/btl/linuxbrew/Cellar/cmake/3.3.0/bin/cmake -P /var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/libcereal-prefix/src/libcereal-stamp/extract-libcereal.cmake
-- extracting...
     src='/var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/external/cereal-v1.0.0.tgz'
     dst='/var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/external/cereal-1.0.0'
-- extracting... [tar xfz]
CMake Error: Problem with archive_write_finish_entry(): Can't restore time
CMake Error: Problem extracting tar: /var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/external/cereal-v1.0.0.tgz
-- extracting... [error clean up]
CMake Error at /var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/libcereal-prefix/src/libcereal-stamp/extract-libcereal.cmake:33 (message):
  error: extract of
  '/var/tmp/sjackman/salmon20150803-27338-1qoka2g/salmon-0.4.2/external/cereal-v1.0.0.tgz'
  failed

Is it possible to compile salmon with its external dependencies provided externally rather than vendored into salmon? What is cereal?

Gibbs and Bootstrap issues

Hi, I was trying to investigate a thing using the Gibbs samples (and bootstraps).

I vaguely recall seeing both issues I'm reporting here somewhere, but I can't find those by searching.

I used Salmon 0.6.0, first with this command

salmon quant \
-i /nfs/research2/teichmann/reference/danio-rerio/salmon/zv9_cdna_ercc_gfp_index \
-l IU \
-p 4 \
-1 SRS515267_1.fastq \
-2 SRS515267_2.fastq \
-o /tmp/SRS515267_salmon_out \
--biasCorrect \
--useFSPD \
--numGibbsSamples 100 \
&& mv /tmp/SRS515267_salmon_out ."

Then I converted the Gibbs samples bootstrap files to a TSV with the script in this repository.

The problem is that I get rather nonsensical values in the Gibbs samples (attaching a screenshot of my notebook because that is the quickest way...)

What I vaguely recall is that the Gibbs samples does not work with one of the bias modelling options?

I think there should at least be a warning in the documentation entry for --numGibbsSamples

After this I ran the same command, but replaced --numGibbsSamples with --numBootstraps.

Now I get this:

The values seem all right, except I'd prefer the values to be TPM's rather than counts. Is the transformation I'm doing in "line 78" fine for converting the counts back to TPM? Or is it more complicated than that?

Gene counts missing transcripts

Hi,

Fantastic program, so much quicker than our old workflow. I ran into one issue when trying to get the gene counts in addition to the transcript counts (quant.genes.sf & quant.sf). I know Ucp1 should be highly expressed, but didn't see it in quant.genes.sf.

I have Ucp1 in the gene to transcript list (the list I supplied using the -g option):
$ grep Ucp1 ucsc2gene.tsv
uc009mjx.2 Ucp1

I don't have the Ucp1 gene in the Salmon gene output (quant.genes.sf):
$ grep Ucp1 quant.genes.sf
(nothing)

But I do see the corresponding transcript in the Salmon transcript output (quant.sf):
$ grep uc009mjx.2 quant.sf
uc009mjx.2 1644 1456.52 1599.27 51083

Based on this, I would have expected to see Ucp1 in the quant.genes.sf, as I thought this file was a summation of the transcript counts corresponding to a given gene. Am I missing something on the functionality of the geneMap option, or should have I seen that gene in my quant.genes.sf output.

Thanks for your help.
Matt

How to convert TPM to FPKM from Salmon output

Hi, I really like salmon, it's really fast. Somehow I need FPKM rather than TPM. I got the output with 3 values out: Length, TPM and NumReads.
In RSEM, they have both length (which equal to the length I got from salmon and is an integer) and effective length (which is used for transition between FPKM and TPM).
It seems that effective length is the term used to calculate TPM and also for transition. Also, this effective length varies from sample to sample, hence I could not use that value from RSEM to make the transition. So what should I do, to get this effective length or just get the FPKM?

Thank you!

'
l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every transcript),
FPKM_i = 10^3 / l_bar * TPM_i.
'

Space in fasta sequence causes quiet failure

I had a fasta file with an space at the end of a sequence line. This caused salmon to fail, somewhat quietly (no output files are produced).

It would be nice to report the specific problem with the input or position of the failing line, ...

log:

Logs will be written to output_dir/logs
there is 1 lib
[2015-10-09 15:47:13.170] [jointLog] [info] parsing read library format
[bns_restore_core] Parse error reading ./current_index/bwaidx.amb

Output pseudobams?

Would it be possible to output a pseudobam file, similar to kallisto?

We'd use the pseudobam file to clean PCR duplicates by using unique molecular identifiers (UMI), often used when doing single-cell RNA-seq. The way we are currently doing it is by sticking the UMI for each read in the read name, aligning and then cleaning the BAM file by keeping only alignments with unique (UMI, mapping position) pairs.

I was imagining running salmon twice, feeding back in a UMI-cleaned pseudobam file to do the final quantification.

kallisto doesn't exactly do what we want because the read name is lost in the pseudobam, but there isn't any reason it could be kept or stuck in the INFO field as a tag.

Automatically infer read orientation

Suggested by @mdshw5: As is done by BWA-mem, use a small batch of reads to infer the most likely library orientation. The user should also be able to explicitly provide the library type, which will by-pass the inference and impose the user's choice on the fragment compatibility function.

NaNs generated for up to 60% of transcripts with --useFSPD and --biasCorrect turned on

After generating issue #48, I took the recommendation of switching to the standard EM algorithm, but I'm having further problems. However, I don't think the problem was with the Variational EM algorithm, but an issue with how the dataset is behaving.

This is with the same dataset as before (single end, rRNA-depleted, second-strand protocol, extreme depth of 170M+ reads). I have the options --useFSPD and --biasCorrect turned on with library type "SF"; the full call is

salmon quant -i $SALMON_DIR -l SF -r <(gzip -c -d $IN_FILE) -o $OUTPUT \
                --numBootstraps 100 --useFSPD --geneMap $GENES \
                --biasCorrect -p 59

I had attempted to use wasabi and run sleuth, but I got an error where the number of transcripts passing the initial filter was "NA". I then discovered that for four samples, many of the transcripts had "-nan" generated for the "NumReads" column, and this led to all of them having "-nan" for the TPM column. One sample had ~100 that failed, but the other three had a variable 106K-109K out of 176K total transcripts fail. No warning or error was thrown during the quantifying or EM optimization steps, so I don't know what happened.

Interesting, I should note that the NaNs are only generated when both biasCorrect and useFSPD are turned on. NaNs are not generated when I use only one or neither option. (this was only tested with one sample though)

If you have immediate suggestions, that would be great. Otherwise, I can work on generating a test dataset.

Error compiling salmon on Ubuntu 14.04

After starting a blank Ubuntu machine, and then executing:

sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
         default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \
         python-matplotlib python-pip python-virtualenv sysstat fastqc \
         trimmomatic bowtie samtools blast2
sudo apt-get -y install cmake libboost-all-dev liblzma-dev

curl -O -L https://github.com/COMBINE-lab/salmon/archive/v0.5.0.tar.gz

tar xzf v0.5.0.tar.gz
cd salmon-0.5.0
cmake .
make

gives

In file included from /home/ubuntu/salmon-0.5.0/include/BAMQueue.hpp:146:0,
                 from /home/ubuntu/salmon-0.5.0/include/AlignmentLibrary.hpp:14,
                 from /home/ubuntu/salmon-0.5.0/src/SalmonUtils.cpp:13:
/home/ubuntu/salmon-0.5.0/include/BAMQueue.tpp: In function 'bool checkProperPairedNames_(const char*, const char*, uint32_t)':
/home/ubuntu/salmon-0.5.0/include/BAMQueue.tpp:247:33: error: 'BOOST_LIKELY' was not declared in this scope
     if (BOOST_LIKELY(nameLen > 1)) {
                                 ^

I'm not sure if this is a Boost version requirement, or what -- looks like I have boost 1.54 installed.

Suggestions welcome!

Recognition of file type for reads based on filename, not content

So, it won't let me use .fq.1 and .fq.2 - it has to be .1.fq and .2.fq.

I think either allowing a 'force' or simply checking to see if the file is FASTQ would be great.