seqan / slimm Goto Github PK

Species Level Identification of Microbes from Metagenomes

License: Other

CMake 9.40% C++ 67.08% Python 23.52%

slimm's Introduction

SLIMM - Species Level Identification of Microbes from Metagenomes

a taxonomic profiling tool that investigates which microorganisms are present in a sequenced sample. SLIMM requires a BAM/SAM alignment file as an input. One can use a read mapper of choice to map raw reads obtained from a sequencing machine to obtain the BAM/SAM file required as input for SLIMM.

slimm_build [OPTIONS] -nm names.dmp -nd nodes.dmp FASTA_DB nucl_gb.accession2taxid
slimm [OPTIONS] $SLIMM_DB_PATH $SAM_FILE_PATH
Try 'slimm --help' for more information.

VERSION

* SLIMM version: 0.3.0
* Last update: February 2018

Downloads:

Executables

Pre-built executables for Linux and Mac are made available at the releases page.

Source code

You can build SLIMM from its source. Instruction on how to build from source can be found at the [slimm wiki] (https://github.com/seqan/slimm/wiki)

Cite us

If you use SLIMM in your work-flows, don't forget to cite us.

Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K. (2017) SLIMM: species level identification of microorganisms from metagenomes. PeerJ 5:e3138 https://doi.org/10.7717/peerj.3138

slimm's People

Contributors

Stargazers

Watchers

Forkers

temehi cdiener sdwfrost

slimm's Issues

--directory option causes a segmentation fault and produce bad profiles

Running slimm on a batch BAM/FILES within a dirctory using the --directory producess a segmentation fault after processing a couple of files. The profiles generated and aother results are also corrupted except for the first file.

But running slimm on individual files one by one works fine and produces the correct result.

[Done!] File took 3 secs to process.

Reading 3 of 8 files ... (simulated_reads.sam)
=================================================================
Intializing coverages for all reference genome ... [0 secs]
Analysing alignments, reads and references ....... [0 secs]
  345 records processed.
    297 matching reads
    297 uniquily matching reads
  references with reads = 1
  expected bins coverage = 0.00547997
  bins coverage cut-off = 0.016129 (0.95 quantile)
  uniq bins coverage cut-off = 0.00446429 (0.95 quantile)

Filtering unlikely sequences ..................... [0 secs]
  0 passed the threshould coverage.
  1 ref's couldn't pass the coverage threshould.
  0 ref's couldn't pass the uniq coverage threshould.
  uniquily matching reads increased from 297 to 0

Assigning reads to Least Common Ancestor (LCA) ... [0 secs]
Writing taxnomic profile(s) ...................... /storage/mi/dadi/workspace/development/slimm/include/seqan/include/seqan/basic/basic_exception.h:363 FAILED!  (Uncaught exception of type std::out_of_range: _Map_base::at)

stack trace:
  0                      [0x449c19]
  1                      [0x44a1c6]
  2                      [0x47e326]
  3                      [0x47e371]
  4                      [0x471c73]
  5                      [0x4d0e1f]
  6                      [0x41d41c]
  7                      [0x4459b8]
  8                      [0x4230a0]
  9                      [0x404cbf]
 10                      [0x514740]
 11                      [0x4060a7]

[1]    3361 abort      slimm -v -d -o slimm_reports/ slimm_db_5K.sldb bam_files
➜

Question: How does slimm deal with discordant mapping of paired end reads?

Dear @temehi & @agakrawczyk ,

@agakrawczyk and me ran in problems when using bowtie2 and slimm:

Yara, bowtie2 and slimm indizes with Human chr1 & chr11 and C-RVDB (https://rvdb.dbi.udel.edu/) were built.
An in silico data set comprising 91% Human chr1 & chr11 reads and 9% Human virus reads of various species was generated and mapped with bowtie2 or yara.
Slimm was used for abundance estimation for bowtie2 or yara mappings.

While "yara+slimm" gave consistent results for all assayed viruses, "bowtie2+slimm" did fail for two viruses. Closer inspection on mapping files showed that bowtie2 reported many discordant mapping across various reference sequences:

MK630134.1_50_0 97      KY315545.1      7713    1       301M    KY315552.1      8791    0       CCAGGTCCAGTCAGATAATAAATATCCGATAAGGAACAAAAGGAAAAGTCAAAGTCCTGGAAAGCATCCACCCTGATTCTCTTGCCGGAATGTTTTGCCACGTAATCCTTCATAGCAGGGAGATTCCTCTGTAAAAGGATAAATTCCTCCCGGCCGCATTTAGTCTCTCCGAACCACATCTTTTCCTGCTCTGGATCTAACTCCGCTTCGGCAAAAGGCGGAAATTGTCTAAGTCCAATATTGAAGAACTGGACCAATGATTCTGCTATGATGTACACTTTTTCCGTCACCGTGTCCACGG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGDGGFGGGGGGGGG9FGGGFGGGGGFGGGGG<GCGGFGGFAGCGGG@GGGGEFGEFDGGGGD9FGGGGGGAGGEFFFCGFGFDAEEFF@FF@8GGECFCDFG7;@@EGFGG*AFCFG,FGE:EF@F,:@G*G?DFDF7*FGCGGCC9EFE+FDG)FC78FFG6<2GG73;+<AA2F881)1)5.)8.0:).+AF09:F9/*   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:281C19     YT:Z:UP
MK630134.1_50_0 353     KY290183.1      7394    1       301M    KY315552.1      8791    0       CCAGGTCCAGTCAGATAATAAATATCCGATAAGGAACAAAAGGAAAAGTCAAAGTCCTGGAAAGCATCCACCCTGATTCTCTTGCCGGAATGTTTTGCCACGTAATCCTTCATAGCAGGGAGATTCCTCTGTAAAAGGATAAATTCCTCCCGGCCGCATTTAGTCTCTCCGAACCACATCTTTTCCTGCTCTGGATCTAACTCCGCTTCGGCAAAAGGCGGAAATTGTCTAAGTCCAATATTGAAGAACTGGACCAATGATTCTGCTATGATGTACACTTTTTCCGTCACCGTGTCCACGG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGDGGFGGGGGGGGG9FGGGFGGGGGFGGGGG<GCGGFGGFAGCGGG@GGGGEFGEFDGGGGD9FGGGGGGAGGEFFFCGFGFDAEEFF@FF@8GGECFCDFG7;@@EGFGG*AFCFG,FGE:EF@F,:@G*G?DFDF7*FGCGGCC9EFE+FDG)FC78FFG6<2GG73;+<AA2F881)1)5.)8.0:).+AF09:F9/*   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:281C19     YT:Z:UP
MK630134.1_50_0 353     KY316048.1      11168   1       301M    KY315552.1      8791    0       CCAGGTCCAGTCAGATAATAAATATCCGATAAGGAACAAAAGGAAAAGTCAAAGTCCTGGAAAGCATCCACCCTGATTCTCTTGCCGGAATGTTTTGCCACGTAATCCTTCATAGCAGGGAGATTCCTCTGTAAAAGGATAAATTCCTCCCGGCCGCATTTAGTCTCTCCGAACCACATCTTTTCCTGCTCTGGATCTAACTCCGCTTCGGCAAAAGGCGGAAATTGTCTAAGTCCAATATTGAAGAACTGGACCAATGATTCTGCTATGATGTACACTTTTTCCGTCACCGTGTCCACGG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGDGGFGGGGGGGGG9FGGGFGGGGGFGGGGG<GCGGFGGFAGCGGG@GGGGEFGEFDGGGGD9FGGGGGGAGGEFFFCGFGFDAEEFF@FF@8GGECFCDFG7;@@EGFGG*AFCFG,FGE:EF@F,:@G*G?DFDF7*FGCGGCC9EFE+FDG)FC78FFG6<2GG73;+<AA2F881)1)5.)8.0:).+AF09:F9/*   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:281C19     YT:Z:UP
MK630134.1_50_0 353     KY274508.1      7501    1       301M    KY315552.1      8791    0       CCAGGTCCAGTCAGATAATAAATATCCGATAAGGAACAAAAGGAAAAGTCAAAGTCCTGGAAAGCATCCACCCTGATTCTCTTGCCGGAATGTTTTGCCACGTAATCCTTCATAGCAGGGAGATTCCTCTGTAAAAGGATAAATTCCTCCCGGCCGCATTTAGTCTCTCCGAACCACATCTTTTCCTGCTCTGGATCTAACTCCGCTTCGGCAAAAGGCGGAAATTGTCTAAGTCCAATATTGAAGAACTGGACCAATGATTCTGCTATGATGTACACTTTTTCCGTCACCGTGTCCACGG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGDGGFGGGGGGGGG9FGGGFGGGGGFGGGGG<GCGGFGGFAGCGGG@GGGGEFGEFDGGGGD9FGGGGGGAGGEFFFCGFGFDAEEFF@FF@8GGECFCDFG7;@@EGFGG*AFCFG,FGE:EF@F,:@G*G?DFDF7*FGCGGCC9EFE+FDG)FC78FFG6<2GG73;+<AA2F881)1)5.)8.0:).+AF09:F9/*   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:281C19     YT:Z:UP
MK630134.1_50_0 353     MH698400.1      15286   1       301M    KY315552.1      8791    0       CCAGGTCCAGTCAGATAATAAATATCCGATAAGGAACAAAAGGAAAAGTCAAAGTCCTGGAAAGCATCCACCCTGATTCTCTTGCCGGAATGTTTTGCCACGTAATCCTTCATAGCAGGGAGATTCCTCTGTAAAAGGATAAATTCCTCCCGGCCGCATTTAGTCTCTCCGAACCACATCTTTTCCTGCTCTGGATCTAACTCCGCTTCGGCAAAAGGCGGAAATTGTCTAAGTCCAATATTGAAGAACTGGACCAATGATTCTGCTATGATGTACACTTTTTCCGTCACCGTGTCCACGG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGDGGFGGGGGGGGG9FGGGFGGGGGFGGGGG<GCGGFGGFAGCGGG@GGGGEFGEFDGGGGD9FGGGGGGAGGEFFFCGFGFDAEEFF@FF@8GGECFCDFG7;@@EGFGG*AFCFG,FGE:EF@F,:@G*G?DFDF7*FGCGGCC9EFE+FDG)FC78FFG6<2GG73;+<AA2F881)1)5.)8.0:).+AF09:F9/*   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:281C19     YT:Z:UP
MK630134.1_50_0 353     KY315552.1      7711    1       301M    =       8791    0       CCAGGTCCAGTCAGATAATAAATATCCGATAAGGAACAAAAGGAAAAGTCAAAGTCCTGGAAAGCATCCACCCTGATTCTCTTGCCGGAATGTTTTGCCACGTAATCCTTCATAGCAGGGAGATTCCTCTGTAAAAGGATAAATTCCTCCCGGCCGCATTTAGTCTCTCCGAACCACATCTTTTCCTGCTCTGGATCTAACTCCGCTTCGGCAAAAGGCGGAAATTGTCTAAGTCCAATATTGAAGAACTGGACCAATGATTCTGCTATGATGTACACTTTTTCCGTCACCGTGTCCACGG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGDGGFGGGGGGGGG9FGGGFGGGGGFGGGGG<GCGGFGGFAGCGGG@GGGGEFGEFDGGGGD9FGGGGGGAGGEFFFCGFGFDAEEFF@FF@8GGECFCDFG7;@@EGFGG*AFCFG,FGE:EF@F,:@G*G?DFDF7*FGCGGCC9EFE+FDG)FC78FFG6<2GG73;+<AA2F881)1)5.)8.0:).+AF09:F9/*   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:281C19     YT:Z:UP

Yara did not report these discordant mappings.

My question: How does slimm deal with discordant mappings? My suspection is that these are discarded.

Thanks in advance!

Can SLIMM provide report on sub species?

For example, in the taxonomy file names.dmp, regarding species Acidithiobacillus ferrooxidans, there are two sub species ( Acidithiobacillus ferrooxidans ATCC 23270 and ATCC 53993), I have two questions: (1) besides Acidithiobacillus ferrooxidans, can SLIMM provide report on Acidithiobacillus ferrooxidans ATCC 23270 and Acidithiobacillus ferrooxidans ATCC 53993, separately? (2) if some reads are aligned/assigned to sub species ATCC 23270, some to sub species ATCC 53993, some to species Acidithiobacillus ferrooxidans, how the coverage for species Acidithiobacillus ferrooxidans reported in the file sorted_species_reported.tsv is calculated?

Thanks,

Qunyuan Zhang ([email protected])

Accessions not mapped to taxaid

Dear @temehi

$ slimm --version
slimm version: 0.3.4
SeqAn version: 2.4.0

Following the procedure outlined in https://github.com/seqan/slimm/wiki/Preparing-a-custom-database I would like to build a slimmdb for the following FASTA file:

>NC_024015.1
AGAATTTGCCC
>NC_001798.2
AGTCCCCGTCT
>NC_030692.1
TGTTGCGTTAA
>NC_027892.1
CAGCTCTCGCA
>NC_029642.1
TGTTGCGTTAA

I downloaded the taxdump.tar.gz and *.accession2taxid.gz as outlined.

The building of the db fails because all five accessions can not be mapped:

$ slimm_build -v -nm names.dmp -nd nodes.dmp -o test.sldb test.fasta *.accession2taxid.gz
[MSG] getting accessions numbers from fasta file ...
[MSG] mapping accessions to taxaid ...
[VERBOSE MSG] mapping file: [1/3]       iter: [1]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [1]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [2]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [3]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [4]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [1]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [2]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [3]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [4]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [5]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [6]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [7]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [8]       accessions left: [5/5]
[WARNING!] 5 accessions (NC_001798, NC_024015, NC_027892, ...) were not mapped to taxaid.
[WARNING!] Take a look at test.missed file for a complete list.
[WARNING!] Try including the more ACCESSION2TAXAID MAP FILE (e.g. dead_nucl.accession2taxid)
[MSG] loading nodes and names mappings from files ...
[MSG] getting taxonomic linages and resolving names ...

However the accessions exist in the *.accession2taxid.gz files:

$ zcat *.accession2taxid.gz | grep -e "NC_024015.1" -e "NC_001798.2" -e "NC_030692.1" -e "NC_027892.1" -e "NC_029642.1"
NC_024015       NC_024015.1        1587534 612184456
NC_001798       NC_001798.2        10310   820945149
NC_027892       NC_027892.1        1715290 931317065
NC_029642       NC_029642.1        1715293 1004345262
NC_030692       NC_030692.1        1714622 1049010306

I also tried with unzipping the accession2taxid.gz files to no avail.

Can you provide any help?

Best,

can't download viral bowtie2 index

Hi,

I found the following command not work

wget https://ftp.mi.fu-berlin.de/pub/dadi/slimm/V_genomes_indices_20180924_bowtie.zip

$ wget https://ftp.mi.fu-berlin.de/pub/dadi/slimm/V_genomes_indices_20180924_bowtie.zip
--2019-01-16 11:26:15-- https://ftp.mi.fu-berlin.de/pub/dadi/slimm/V_genomes_indices_20180924_bowtie.zip
Resolving ftp.mi.fu-berlin.de... 160.45.117.8
Connecting to ftp.mi.fu-berlin.de|160.45.117.8|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2019-01-16 11:26:16 ERROR 404: Not Found.

Could you help to look at this problem?

Many thanks

Zhuofei

various questions

Thanks a lot for your efforts developing this promising tool.

I would like to use SLIMM to characterize microbial species composition from various metagenomes.
I do not have any issue to report, I have just some questions :

Is it a good idea to try bowtie2 local alignement parameters? Is there any rationale to set the -k value such as metagenomes a priori diversity or size ?
Do you have any details/guidances regarding the -w and --cov-cut-off parameters ?
Does the _profile.tsv output report taxa abundance (i.e., genome size normalized) or is it just the percentage of reads?
Is there any easy way to merge all the _profile.tsv or generate a biom file species table which could be easily loaded in R phyloseq?
I would like to prepare a custom database using the 'curated' database created by the DTU, I have made some tests on two sequences and the header seems to be not properly formatted.

[MSG] getting accessions numbers from fasta file ...
[MSG] mapping accessions to taxaid ...
[WARNING!] 1 accessions (kraken:taxid, ...) were not mapped to taxaid.
[WARNING!] Take a look at slimm_db_custom.missed file for a complete list.
[WARNING!] Try including the more ACCESSION2TAXAID MAP FILE (e.g. dead_nucl.accession2taxid)
[MSG] loading nodes and names mappings from files ...
[MSG] getting taxonomic linages and resolving names ...
[@cc2-n23 SLIMM]$ tail slimm_db_custom.missed
kraken:taxid

kraken:taxid|556484|NC_011669.1|1 Phaeodactylum tricornutum CCAP 1055/1 chromosome 1, whole genome shotgun sequence
`
If you have an idea how to fix this?

Thanks a lot for your help and efforts developing SLIMM.

Cheers,

Flo

Simulation tests failed!

Hi,

Congratulations on your works, I just tested SLIMM and I found it was partially working as I wished.

I downloaded the 5 sequences as references to generate simulation read with art. Then these reads were subjected to AB-5K pre-formatted database as you provided. The alignment result was analysis by SLIMM.

Reference:

NC_003028.3 Streptococcus pneumoniae TIGR4, complete genome
NC_021994.1 Enterococcus faecium Aus0085, complete genome
NZ_CP015831.1 Escherichia coli O157 strain 644-PT8, complete genome
NZ_CP020463.1 Staphylococcus epidermidis strain 1457 chromosome, complete genome
NZ_CP016032.1 Serratia marcescens strain U36365, complete genome

I think the coverage should be like depth-coverage. Then the value for each species tested should be 20. But coverage for Serratia marcescens were almost 0.
Anyway the good news is what I want is reported.

Results (sorted by coverage):
3 Streptococcus pneumoniae 1313 287886 11.7744 1 19.9843
9 Enterococcus faecium 1352 399132 16.3243 1 18.4821
4 Escherichia coli 562 644243 26.3492 6 18.238
7 Staphylococcus epidermidis 1282 304714 12.4626 4 17.6741
5 Serratia marcescens 615 25625 1.04805 4 0.714683
6 Serratia sp. FS14 1327989 3657 0.149569 1 0.104488
1 Escherichia fergusonii 564 3336 0.136441 1 0.102832
2 Serratia sp. SCBI 488142 3179 0.130019 1 0.093465
8 Citrobacter rodentium 67825 1362 0.0557051 1 0.0375254

Code:
#simulation
art_illumina -ss HSXt -sam -i candidate.fasta -p -l 150 -f 20 -m 200 -s 50 -o candidate_paired
#alignment with bowtie2
bowtie2 -x /home/liyang/test_tool/slimm-0.2.2/AB_5K_indexed_ref_genomes_bowtie2/AB_5K -1 candidate_paired1.fq -2 candidate_paired2.fq -q --no-unal --mm -p 10 -k 60 2>candidate_bowtie2.log | samtools view -bS - > candidate_bowtie2.bam
#Analysis with SLIMM
slimm -m /home/liyang/test_tool/slimm-0.2.2/slimmDB_5K/ -o ./ candidate_bowtie2.bam

Could you give me some suggestions?

Cheers!

problem building from source

I ran cmake just fine, but then ran into a problem when running make:

This worked fine:

$ cmake ../slimm/src -DCMAKE_PREFIX_PATH="$HOME/bin/applications/seqan/util/cmake" -DSEQAN_INCLUDE_PATH="$HOME/bin/applications/seqan/include"

But, then I got this error in the next step:

$ make
Scanning dependencies of target slimm
make[2]: Warning: File `CMakeFiles/slimm.dir/depend.make' has modification time 40 s in the future
[ 50%] Building CXX object CMakeFiles/slimm.dir/slimm.o
<command-line>:0:19: error: too many decimal points in number
/data/home/dfulop/bin/applications/slimm/src/slimm.h:592:24: note: in expansion of macro 'SEQAN_APP_VERSION'
     setVersion(parser, SEQAN_APP_VERSION);
                        ^
make[2]: *** [CMakeFiles/slimm.dir/slimm.o] Error 1
make[1]: *** [CMakeFiles/slimm.dir/all] Error 2
make: *** [all] Error 2

It seems a bit odd that this would happen, that is that it would fail due to SeqAn's version. I installed SeqAn with Linuxbrew, but the seqan dir pointed to in the cmake flags is a downloaded source. However, it's the same version number as the Linuxbrew-installed seqan.

empty taxonomic profiles

I am getting empty taxonomic profiles out of SLIMM, despite the reference filtering step preserving 100 or more taxa (more or less depending on parameters). I am puzzled as to how I should adjust parameters to get SLIMM to work. I am adjusting the parameters that make sense to me from reading your manuscript, namely --bin-width, --min-reads, and --cov-cutoff.

Any advice in terms of troubleshooting, QC, and/or parameter settings would be greatly appreciated!

Here's one example stdout from my data:

Reading 1 of 6 files ...
SWT1_S36_microbe_mapped.bam
computing features of each reference genome ...
in 0 secs [OK!]

Analysing alignments, reads and references ...
  124865015 records processed.
    39820109 matching reads
    25279026 uniquily matching reads
in 738 secs

Number of Ref with reads = 4363
Expected Coverage = 0.374302
Coverage Cutoff = 0.00156425 (0.999 quantile)
UniqCoverage Cutoff = 0.000161812 (0.999 quantile)
Filtering unlikely sequences ...
  441 passed the threshould coverage.
  1217 ref's couldn't pass the coverage threshould.
  3905 ref's couldn't pass the uniq coverage threshould.
  Uniquily matching reads increased from 25279026 to 25455761
in 45 secs [OK!]

Writing features to a file ...
in 0 secs [OK!]

Assigning reads to Least Common Ancestor (LCA)
in 36 secs [OK!]

Writing taxnomic profile(s) ...
        species level: 0 bellow cutoff (0.001) ...
          genus level: 0 bellow cutoff (0.001) ...
         family level: 0 bellow cutoff (0.001) ...
          order level: 0 bellow cutoff (0.001) ...
          class level: 0 bellow cutoff (0.001) ...
         phylum level: 0 bellow cutoff (0.001) ...
   superkingdom level: 0 bellow cutoff (0.001) ...
in 0 secs [OK!]

Segmentation fault during assigning reads to LCA

Might be an issue with my setup, but I thought I'd ask anyway. As far as I know, I've done everything correctly in set up (downloaded binary for my system, got the bowtie2 index and the database, used bowtie2 and samtools to generate the .bam file). I get a segmentation fault during the "assigning reads to LCA" step when I try to run slimm with the .bam file.

For reference, I'm trying to use the program in a shared cluster environment (my local workstation is lacking in memory).

Any ideas? Thanks in advance!

Problem with bowtie2 example

In the bowtie2 example, the bowtie2 -k 60 switch causes a corrupt sam files to be created - it took me a long time to track this problem down. If you are able to see the same problem with the software versions I have listed below, you may want to remove this from your example.

Do you have a recommendation for how to update bowtie2 command in the absence of the -k 60 switch? Do you recommend any change since I need to run without this included?

I am following instructions posted: https://github.com/seqan/slimm/wiki/Example-Bowtie2 using the already indexed reference genomes: http://ftp.mi.fu-berlin.de/pub/dadi/slimm/

I have installed SLIMM and the AB_5K & AB_13K index edon my Macbook Pro (OS X 10.11.6) and on my university's Redhat Linux HPC (High Performance Computing) cluster. Both environments have the same problem creating the alignment bam file with either index. The problem is the sam file piped into samtools view is corrupt due to the bowtie2 -k 60 switch. It is not clear why this error occurs but it happens with both 5K & 13K precompiled index.

Samtools view error

(also seen if bowtie2 creates a sam file with -S and passes it to SLIMM)
Samtools
[W::sam_read1] parse error at line 4548
[main_samview] truncated file.

If I use bowtie2 -S to output a sam file, the same file will fail validation when checked with the Picard Tools ValidateSamFile:

Picard Tools ValidateSamFile htsjdk.samtools.SAMFormatException

Caused by: htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File test.sam; Line 13197

command line input

bowtie2 -x /Users/msioda/slimm/AB_5K_indexed_ref_genomes_bowtie2/AB_5K
-U /Applications/bowtie2-2.3.1/example/reads/longreads.fq
-q --no-unal --very-fast --mm -k 60 -p 4
2> /Users/msioda/slimm/test/lambdaErrors2.txt | samtools view -bS ->
/Users/msioda/slimm/test/lambda2.bam

This statement works fine if the -k 60 is removed. Do you know why this happens?

Software versions:

bowtie2/2.3.1
samtools/1.3.1

Also, bowtie v 2.2.0 or above is required to handle .bt2l large index files. You may want to list this as a dependancy.

build from source locally

I want to suggest that you add information for building SLIMM from source locally (i.e. without root access). Following the instructions here (http://seqan.readthedocs.io/en/master/Infrastructure/Use/FindSeqAnCMake.html#) I was able to figure out I needed to use 2 cmake options for it to build, e.g.:

cmake ../slimm/src \
-DCMAKE_PREFIX_PATH="$HOME/bin/applications/seqan/util/cmake" \
-DSEQAN_INCLUDE_PATH="$HOME/bin/applications/seqan/include"

Custom Database Questions

Thank you for updating the directions on building a custom database. I notice that your options allow for building a database with all refseq Bacterica, Archaea and Viruses, but does not include fungi or other eukaryotes. Two questions:

Would it be possible to include fungi as an option in the pre-processing scripts?
I am interested in building a custom database from NT, how would one go about doing so?

Thanks,
Matt

[feature/question] get coverage profiles

Hi, first of all congrats and thanks for SLIMM, awesome software. I was wondering if there is a way to get the the actual coverage profiles for individual species across the bins. SLIMM is pretty fast so I would like to use it to do something like in https://www.ncbi.nlm.nih.gov/pubmed/26229116. From a quick scan of the code base it looks like there is currently not an option for that, but the information is stored in the reference contig.

If I wanted the the coverage profiles for reads after they have been assigned to the most likely reference I would use the the uniq_cov2.bin_heights, correct?

Explanation of species-level classifcation filed "s_unknown_species"

Dear Temesgen,

I need some clarification on the output of slimm species-level classification "_profile.tsv". For a test sample I get the following output:

taxa_level      taxa_id linage  abundance       read_count
species 9606    k__Eukaryota|p__Chordata|c__Mammalia|o__Primates|f__Hominidae|g__Homo|s__Homo sapiens   89.1426 25602864
species 45219   k__Viruses|p__unknown_phylum|c__unknown_class|o__unknown_order|f__Arenaviridae|g__Mammarenavirus|s__Guanarito mammarenavirus    0.0178544       5128
species 1821749 k__Viruses|p__unknown_phylum|c__unknown_class|o__Picornavirales|f__Picornaviridae|g__Cardiovirus|s__Cardiovirus A       0.0136867       3931
species 0*      k__unknown_superkingdom|p__unknown_phylum|c__unknown_class|o__unknown_order|f__unknown_family|g__unknown_genus|s__unknown_species       10.8259 3109321

While most reads are classified as Human (89.14% of the reads), 11% of the reads are classified as unknown species. This is confusing because these reads are contained in the BAM file and must be mapped to a reference genome (bowtie2 --no-unal).

Does the fraction of 11% correspond to one species? Or: Could the species not be resolved because of missing taxonomic information? Or: Are these reads not discriminative for 1 species?

All the best,
Johannes