bcgsc / biobloom Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 15.0 3.1 MB

Create Bloom filters for a given reference and then use it to categorize sequences

Home Page: http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools

License: GNU General Public License v3.0

C++ 64.75% C 3.97% Shell 27.96% Makefile 0.63% M4 0.39% Perl 1.47% Python 0.85%

biobloom's People

Contributors

Stargazers

Watchers

Forkers

gjahesh zyxue sara62 kylacochrane collins2005 strideradu pythseq biocodings yaohaobin entcai hwill-ship murathangoktas fallinwind fycq22

biobloom's Issues

Update IO code

Update biobloomcategorizer, biobloommimaker, and biobloommaker to use producer consumer i/o for faster I/O when multiple threads are used

Add smart pair for read pairs

If files are from either a semi sorted file (where read pairs are close by) or read pairs are interleaved to automatically pair.

Currently, we require 2 files in the same order for paired fastq/fasta.

core dumps for long parameter names

$ biobloomcategorizer \
    --prefix experiment/case1/tasrkleat-results/biobloomcategorizer/cba \
    --paired_mode \
    --score 0.6 \
    --with_score \
    --inclusive \
    --filter_files 'experiment/utrtargets/bf/combined.bf' \
    --threads 4 \
    --fq experiment/case1/tasrkleat-results/extract_tarball/cba_1.fastq experiment/case1/tasrkleat-results/extract_tarball/cba_2.fastq
Segmentation fault (core dumped)

Changing --prefix to -p causes a different error

$ biobloomcategorizer \
  -p experiment/case1/tasrkleat-results/biobloomcategorizer/cba \
  --paired_mode \
  --score 0.6 \
  --with_score \
  --inclusive \
  --filter_files 'experiment/utrtargets/bf/combined.bf' \
  --threads 4 \
  --fq experiment/case1/tasrkleat-results/extract_tarball/cba_1.fastq experiment/case1/tasrkleat-results/extract_tarball/cba_2.fastq
terminate called after throwing an instance of 'std::logic_error'
what():  basic_string::_S_construct null not valid
Aborted (core dumped)

Changing --score to -s makes the error gone

$ biobloomcategorizer \
  -p experiment/case1/tasrkleat-results/biobloomcategorizer/cba \
  --paired_mode \
  -s 0.6 \
  --with_score \
  --inclusive \
  --filter_files 'experiment/utrtargets/bf/combined.bf' \
  --threads 4 \
  --fq experiment/case1/tasrkleat-results/extract_tarball/cba_1.fastq experiment/case1/tasrkleat-results/extract_tarball/cba_2.fastq
Starting to Load Filters.
Loaded Filter: combined
Filter Loading Complete.
...

Update scoring model in normal BBT

Based on revelations in #14

I need to provide additional scoring options. It would probably be better to add an option to filter matches based scoring option based on a minimum FPR.

Tasks:

Add option for score described in 2013 paper, with ability to switch between this one and current scoring method inside code
Add min fpr option and option for simple score based on total number of matching k-mers
(optional) option for score based computation with min all FP removed. K-mer matches are unlikely to be interleaved with error -> one match must be a false positive.

Add check for identical headers use multiple times

Prevent or handle properly cases where the same fasta header is seen when using the miBF.

Now I am having trouble running /.configure. I see some complaints that configure found the boost files but was not able to compile them. My hunch is that this is actually something failing upstream and causing downstream problems. I will attach the log as well.

Thanks!

checking boost/property_tree/ptree.hpp usability... no
checking boost/property_tree/ptree.hpp presence... yes
configure: WARNING: boost/property_tree/ptree.hpp: present but cannot be compiled
configure: WARNING: boost/property_tree/ptree.hpp:     check for missing prerequisite headers?
configure: WARNING: boost/property_tree/ptree.hpp: see the Autoconf documentation
configure: WARNING: boost/property_tree/ptree.hpp:     section "Present But Cannot Be Compiled"
configure: WARNING: boost/property_tree/ptree.hpp: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------- ##
configure: WARNING:     ## Report this to [email protected] ##
configure: WARNING:     ## ------------------------------- ##
checking for boost/property_tree/ptree.hpp... no
checking boost/property_tree/ini_parser.hpp usability... no
checking boost/property_tree/ini_parser.hpp presence... yes
configure: WARNING: boost/property_tree/ini_parser.hpp: present but cannot be compiled
configure: WARNING: boost/property_tree/ini_parser.hpp:     check for missing prerequisite headers?
configure: WARNING: boost/property_tree/ini_parser.hpp: see the Autoconf documentation
configure: WARNING: boost/property_tree/ini_parser.hpp:     section "Present But Cannot Be Compiled"
configure: WARNING: boost/property_tree/ini_parser.hpp: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------- ##
configure: WARNING:     ## Report this to [email protected] ##
configure: WARNING:     ## ------------------------------- ##
checking for boost/property_tree/ini_parser.hpp... no
checking google/dense_hash_map usability... no
checking google/dense_hash_map presence... yes
configure: WARNING: google/dense_hash_map: present but cannot be compiled
configure: WARNING: google/dense_hash_map:     check for missing prerequisite headers?
configure: WARNING: google/dense_hash_map: see the Autoconf documentation
configure: WARNING: google/dense_hash_map:     section "Present But Cannot Be Compiled"
configure: WARNING: google/dense_hash_map: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------- ##
configure: WARNING:     ## Report this to [email protected] ##
configure: WARNING:     ## ------------------------------- ##
checking for google/dense_hash_map... no
checking btl_bloomfilter/BloomFilter.hpp usability... no
checking btl_bloomfilter/BloomFilter.hpp presence... yes
configure: WARNING: btl_bloomfilter/BloomFilter.hpp: present but cannot be compiled
configure: WARNING: btl_bloomfilter/BloomFilter.hpp:     check for missing prerequisite headers?
configure: WARNING: btl_bloomfilter/BloomFilter.hpp: see the Autoconf documentation
configure: WARNING: btl_bloomfilter/BloomFilter.hpp:     section "Present But Cannot Be Compiled"
configure: WARNING: btl_bloomfilter/BloomFilter.hpp: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------- ##
configure: WARNING:     ## Report this to [email protected] ##
configure: WARNING:     ## ------------------------------- ##
checking for btl_bloomfilter/BloomFilter.hpp... no
checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
configure: error: Requires the Boost C++ libraries, which may
	be downloaded from here: http://www.boost.org/users/download/
	The following commands will download and install Boost:
	cd /usr/local/bin/biobloom
	wget http://downloads.sourceforge.net/project/boost/boost/1.58.0/1_58_0.tar.bz2
	tar jxf 1_58_0.tar.bz2
	cd -

config.log

results affected by FASTA headers

I uncovered an odd bug where the results seem to be affected by the FASTA headers in the reference sequences. Here is a minimal example:

Data

ref.fa:

>ERR365834.4916141      HWI-M01013:41:000000000-A5KUT:1:2107:9861:26375/1
ACCTTAAAGCTTTTTACCCCTCTTAAGTTATATTTCATACAAAACACAACACACACAAAAAATTGCCCCTTAAAGAAAAGTCACAAGATTTAATATATATGCCAAATTATTGAACCTATCATCCTACTATATTATTTGGGAAGTACATTTTAAATGTAACCTTAATTTTTGTAAGTAATTACATAGGTATGCCA
>ERR365834.4916141      HWI-M01013:41:000000000-A5KUT:1:2107:9861:26375/2
TAACAACATCAAATTTGACCACGAGGTCATACCTGAAATGGATTGCTGTGAAACCGGTACAGACTCTTTCACAGCTATTCAGAACTGCACTCAGTTAGCTGATAAAATTTGCAAGTCTTCGTGGGGGCATGCAGATGATGTGCGTATTGACCAATATAG

ref.renamed.fa (same sequences as above, with headers renamed):

>ERR365834.4916141/1
ACCTTAAAGCTTTTTACCCCTCTTAAGTTATATTTCATACAAAACACAACACACACAAAAAATTGCCCCTTAAAGAAAAGTCACAAGATTTAATATATATGCCAAATTATTGAACCTATCATCCTACTATATTATTTGGGAAGTACATTTTAAATGTAACCTTAATTTTTGTAAGTAATTACATAGGTATGCCA
>ERR365834.4916141/2
TAACAACATCAAATTTGACCACGAGGTCATACCTGAAATGGATTGCTGTGAAACCGGTACAGACTCTTTCACAGCTATTCAGAACTGCACTCAGTTAGCTGATAAAATTTGCAAGTCTTCGTGGGGGCATGCAGATGATGTGCGTATTGACCAATATAG

query.fa:

>SRR519624.2022476
TAGTAGGATGATAGGTTCAATAATTTGGCATATATATTAAATCTTGTGACTTTTCTTTAAGGGGCAATTTTTTGTGTGTGTTGTGTTTTGTATGATATAT

Results

Results of querying ref.fa (NO HIT!):

$ samtools faidx ref.fa
$ biobloommaker -k 15 -p ref -f 0.001 -n 100000 ref.fa
$ biobloomcategorizer -d ref -f ref.bf -s 0.2 query.fa 2>&1 >/dev/null | tail -5 | column -t
filter_id   hits  misses  shared  rate_hit  rate_miss  rate_shared
ref         0     1       0       0         1          0
multiMatch  0     1       0       0         1          0
noMatch     1     0       0       1         0          0

Results of querying ref.renamed.fa (HIT (the correct answer!)):

$ samtools faidx ref.renamed.fa
$ biobloommaker -k 15 -p ref.renamed -f 0.001 -n 100000 ref.renamed.fa
$ biobloomcategorizer -d ref.renamed -f ref.renamed.bf -s 0.2 query.fa 2>&1 >/dev/null | tail -5 | column -t
filter_id    hits  misses  shared  rate_hit  rate_miss  rate_shared
ref.renamed  1     0       0       1         0          0
multiMatch   0     1       0       0         1          0
noMatch      0     1       0       0         1          0

Explanation

I suspect that the problem is due to a quirk of samtools faidx and is not BioBloomTools' fault. For example, compare the following two files:

ref.fa.fai:

ERR365834.4916141       159     333     159     160
ERR365834.4916141       159     333     159     160

ref.renamed.fa.fai:

ERR365834.4916141/1     194     21      194     195
ERR365834.4916141/2     159     237     159     160

So it is important to make sure that all of the FASTA IDs for the reference sequences are unique. I think for most users that will be the case, but in my application I am using BioBloomTools to map read pairs to read pairs and this problem crops up.

If the issue can't be fixed, I recommend putting some kind of warning in the README about making sure the FASTA names are unique.

missing source files: Common/HashManager.h, Common/BloomFilterInfo.h

Hi Justin :-)

FYI, I cloned your repo and tried to build. Got the following errors when running make:

g++44 -DHAVE_CONFIG_H -I. -I../../BioBloomMaker -I..  -I../../BioBloomMaker -I../../Common -I../../DataLayer -I..    -g -O2 -MT biobloommaker-BioBloomMaker.o -MD -MP -MF .deps/biobloommaker-BioBloomMaker.Tpo -c -o biobloommaker-BioBloomMaker.o `test -f 'BioBloomMaker.cpp' || echo '../../BioBloomMaker/'`BioBloomMaker.cpp
In file included from ../../BioBloomMaker/BioBloomMaker.cpp:11:
../../BioBloomMaker/BloomFilterGenerator.h:12:32: error: Common/HashManager.h: No such file or directory
../../BioBloomMaker/BioBloomMaker.cpp:12:36: error: Common/BloomFilterInfo.h: No such file or directory
In file included from ../../BioBloomMaker/BioBloomMaker.cpp:11:
../../BioBloomMaker/BloomFilterGenerator.h:37: error: ‘HashManager’ does not name a type
../../BioBloomMaker/BioBloomMaker.cpp: In function ‘int main(int, char**)’:
../../BioBloomMaker/BioBloomMaker.cpp:255: error: ‘BloomFilterInfo’ was not declared in this scope
../../BioBloomMaker/BioBloomMaker.cpp:255: error: expected ‘;’ before ‘info’
../../BioBloomMaker/BioBloomMaker.cpp:259: error: ‘info’ was not declared in this scope
make[2]: *** [biobloommaker-BioBloomMaker.o] Error 1```

samtools faidx problem

I'm attempting to run biobloommaker on an indexed fasta file that I created with "samtools faidx ..." using a variation of the following command:

biobloommaker -p prefix indexed_fasta.fa.fai

But I'm getting the following message:

Fasta files must be indexed. Use samtools faidx.

I'm a software engineer by profession working on developing bioinformatics workflow tools for one of NCI's Cancer Genomics Cloud pilot projects, so due to having only a superficial understanding of bioinformatics tools in general, I have a feeling there may be something I've missed somewhere that's causing me to run into this issue. Any suggestions or advice about what I might be doing wrong would be much appreciated. I'd be happy to provide more information upon request.

Issue compiling release 2.3.2

Hello,

We ran into issues trying to compile the files from the 2.3.2 release obtained from github here:
https://github.com/bcgsc/biobloom/releases/tag/2.3.2
We were working off the .tar.gz archive

Here's the error we got:

Making all in BioBloomCategorizer
make[2]: Entering directory `/u/jpl/Downloads/biobloomtools-2.3.2/BioBloomCategorizer'
g++ -DHAVE_CONFIG_H -I. -I..  -I/u/jpl/Downloads/biobloomtools-2.3.2/BioBloomCategorizer -I/u/jpl/Downloads/biobloomtools-2.3.2/Common -I/u/jpl/Downloads/biobloomtools-2.3.2 -I/u/jpl/Downloads/biobloomtools-2.3.2    -std=c++11  -Wall -Wextra -Werror  -fopenmp -g -O2 -MT biobloomcategorizer-Options.o -MD -MP -MF .deps/biobloomcategorizer-Options.Tpo -c -o biobloomcategorizer-Options.o `test -f 'Options.cpp' || echo '/u/jpl/Downloads/biobloomtools-2.3.2/BioBloomCategorizer/'`Options.cpp
Options.cpp:18:1: error: ‘OutType’ does not name a type
 OutType outputType = NONE;
 ^

Compilation went smoothy with code checked out from the master branch so we are using that for now. However, others might run into similar issues and having a specific release version is always nice when the time comes to write a paper :).

Thanks for all the hard work !

Add into Readme best practices for filter creation using set of reads

Users must set -n to an estimate of cardinality.

The possible option is to use ntCard before to estimate cardinality.

Problems compiling BBT on MacOS

Hey @JustinChu,

I'm trying to figure out how to compile BBT in order to update it's formula to migrate it from brewsci/homebrew-science to brewsci/homebrew-bio.

I've downloaded the source code of BBT version 2.1.1, however I'm having difficulty compiling it on MacOS.

The error I get on MacOS Sierra is:

In file included from SeqEval.cpp:1:0:
../Common/SeqEval.h: In function 'bool SeqEval::evalSingle(const string&, const BloomFilter&, double, const BloomFilter*)':
../Common/SeqEval.h:59:23: error: invalid conversion from 'const uint64_t* {aka const long long unsigned int*}' to 'const size_t* {aka const long unsigned int*}' [-fpermissive]
   if (filter.contains(*itr)) {
                       ^~~~

Repeated for every instance of converting from const uint64_t to const size_t in that file.

My g++ version is:

> g++ --version
g++ (Homebrew GCC 7.3.0_1) 7.3.0

Do you know if this is a problem with my compiler? Or do you have other suggestions?

Thanks

Create indexable output

Add option to output sam or paf output. There won't be alignment but the data could then be indexed for faster access for specific subsets using existing tools.

Error thrown when k-mers where used with miBF code.

Output from 2.3.2

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::at: __n (which is 18446744073709551615) >= this->size() (which is 150)
Aborted (core dumped)

Possibly nthash related. Does not appear on all reads.

Adding a license to a repository

Licenses API can't recognize your License.
Could you please add a license to your repository based on the following guide:
https://help.github.com/articles/adding-a-license-to-a-repository/

Detailed progressive bloom filter report

Because -P is mostly used for debugging and diagnosis of progressive read tagging, I should report additional information than just the reads for ease of analysis.

A proposed new format for -P option in BBM:
Read header should include:

Number of k-mers included from this read
Number of k-mers already in filter
Number of k-mers of read that match repeat filter (*edit thanks @KristinaGagalova)
Number of read pairs already in filter
Read pairs number in file (index)

Note: Assumes the header has no existing comment lines

e.g. (for a single read pair)

@read/1 20 1289300 0 1244 12002
TGGTGCCCAGCAGCGTTTGTAGCGCAATGAGAATTTGCTGCGTCAGACATTCCTGCACCTGCGGACGCTTGGCAAAGAAATGCACAATGCGGTTAATTTTTGACAGACCGATCACCGAAT
CTTTCGGGATATAGGCCACCGTCGCTTTGC
+
2611023222153/43222062322.422/12303462216553442222514220112424444034412251261012142142123.4232210/0/22222231342242131021
512223302201430131050232254013
@read/2 126 1289300 5 1244 12002
TGGTGCCCAGCAGCGTTTGTAGCGCAATGAGAATTTGCTGCGTCAGACATTCCTGCACCTGCGGACGCTTGGCAAAGAAATGCACAATGCGGTTAATTTTTGACAGACCGATCACCGAAT
CTTTCGGGATATAGGCCACCGTCGCTTTGC
+
2611023222153/43222062322.422/12303462216553442222514220112424444034412251261012142142123.4232210/0/22222231342242131021
512223302201430131050232254013

Suggestions on additional information to be provided are welcome. @KristinaGagalova @sahammond

Note that the changes will be made to the https://github.com/bcgsc/biobloom/tree/ntHashBF branch first. This is version that uses ntHash and other speed optimizations and has in some tests I have done an order of magnitude faster performance. So far the code seems stable with no major differences between the old and new code in terms of results.

paired end mode not working?

Hi,
I'm trying to use your software as described in the example with paired end fq reads:

biobloomcategorizer --fq -e -p test123 -f "mg1655.bf" 18072D-01-03_S71_L006_R1_001_val_1.fq.gz 18072D-01-03_S71_L006_R2_001_val_2.fq.gz -t 16

However then the program just tells me I'm doing something wrong. Without -e option it works, however then I get the matching/non-matching reads in one file, non-interleaved, which is not what I want. I assume paired end information is not used. Is there a bug somewhere that keeps biobloomcategorizer from recognizing 2 files as paired end reads when using the -e option?

Project page link does not link to latest release

Hi, at the top of the github repo, you provide this link:
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools

Unfortunately, it links to 2.1.1 as being the current release, which is no longer true.
Please update as this is confusing to less savvy users.

Thanks !

Score computation,

Hello,

I am concerned about the computation of the score in the function: evalSingle (file EvalSeq.h). In line 95,
score += 1 - 1 / (2 * streak);
should not be substituted by
score += 1 - 1.0 / (2.0 * streak);
?

Otherwise 1/ (2 * streak ) is always 0 since streak is unsigned.

On top of that, I do not understand the presence of "2" in the formula, from the score definition you give in the article.

Yours sincerely,

Paula

Filter creation without fai file in Biobloommaker

Either create a fai file automatically or manually parse file if it is missing.

Can't compile on ubuntu 18.04

I'm having trouble compiling biobloom on ubuntu 18.04 with g++ 7.3.0.

Here's the output of make

make  all-recursive
make[1]: Entering directory '/biobloom'
Making all in Common
make[2]: Entering directory '/biobloom/Common'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/biobloom/Common'
Making all in BioBloomMaker
make[2]: Entering directory '/biobloom/BioBloomMaker'
g++ -DHAVE_CONFIG_H -I. -I..  -I/biobloom/BioBloomMaker -I/biobloom/Common -I/biobloom/DataLayer -I/biobloom -I/biobloom   -O3 -std=c++0x  -isystem/biobloom/1_58_0 -Wall -Wextra -Werror -fopenmp -g -O2 -MT biobloommaker-BioBloomMaker.o -MD -MP -MF .deps/biobloommaker-BioBloomMaker.Tpo -c -o biobloommaker-BioBloomMaker.o `test -f 'BioBloomMaker.cpp' || echo '/biobloom/BioBloomMaker/'`BioBloomMaker.cpp
In file included from BloomMapGenerator.h:18:0,
                 from BioBloomMaker.cpp:22:
../bloomfilter/MIBloomFilter.hpp:39:34: error: 'std::vector<std::vector<unsigned int> > parseSeedString(const std::vector<std::__cxx11::basic_string<char> >&)' defined but not used [-Werror=unused-function]
 static vector<vector<unsigned> > parseSeedString(
                                  ^~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
Makefile:405: recipe for target 'biobloommaker-BioBloomMaker.o' failed
make[2]: *** [biobloommaker-BioBloomMaker.o] Error 1
make[2]: Leaving directory '/biobloom/BioBloomMaker'
Makefile:421: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/biobloom'
Makefile:341: recipe for target 'all' failed
make: *** [all] Error 2

Thanks!

Provide warning regarding filters with only a small number of targets <100 when using miBFs

When only a small number of targets are used, the miBF will not be able to classify sequences without incurring a large penality to the FPR when classifying. The result will be that the occupancy of the miBF will have to be turned down for the classification to maintain a decent -s threshold. This show be prevented some how.

2.0.12 tarball does not include configure

The 2.0.12 tarball does not include configure. I'm okay with that, but you'll need to update the build instructions if it's intentional.

biobloommicategorizer -- terminate called after throwing an instance of 'std::out_of_range'

I'm running into a core dump error while running biobloommicategorizer (v2.3.2). I'm creating a mask like so:

biobloommimaker -p reads cosmic_transcripts.fasta -g 56 -t 8

Then running (as part of the tap pipeline):

biobloommicategorizer -t 8 -f reads.bf --fq -p ./bbt_2.3.2/sample -e sample_R1.fastq.gz sample_R2.fastq.gz

I then run into this error (I've cut some of the output):

Loading sdsl interleaved bit vector from: reads.bf.sdsl
Loading header...
Loaded header... magic: MIBLOOMF hlen: 32 size: 273716032 nhash: 56 kmer: 25
Loading data vector
Bit Vector Size: 1470772224
Popcount: 273716032
...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::at: __n (which is 18446744073709551615) >= this->size() (which is 148)
Aborted (core dumped)

My fastq files have different read lengths, so I am wondering whether this might be related.

Revise --ordered option when paired matching

--ordered does not play nice with the paired reads option since the greedy algorithm considers only 1 read at a time.

When it ends up matching multiple reads it may assign one read to 2 different filters causing a no-match to occur. This is generally not seen if filters being used are unrelated, but can be an issue if 2 filters have a substantial k-mer overlap.

Make inclusive mode for MIFB print out closest ID rather than dominating ID

It might make sense to make default BBT have an option to output single file if many filter targets are used.

Tagging and recruitment based on quality string constraint

Add option to make quality string considered when tagging reads to minimize low quality k-mers.

Possible other options: Any k-mers below a specified quality should not be used when considering matches?

biobloommicategorizer crashes when using an input fasta file asking for an output fastq

Convergence not possible when using probability based binomial scoring

When the minimum FPR is too low for the length of a read + the false postive rate of the bf, the program aborts. Somewhat related to issue #51 .

To resolve this elegantly I could ignore read with lengths that cannot fulfil minmum FPR. However for the miBF code, some IDs for a minimum read length may be valid at a minimum FPR. To resolve these cases, I could ignore only IDs with this case. However both these cases are not likely what the user would intend.

Spaced seed classification in Generic Bloom filter

There has been a feature request for spaced seeds to be used in normal BF.

Caveats:
If multiple spaced seed are used, we can expect memory to increase accordingly to maintain FPR.

During miBF construction automatically adjust occupancy to ensure minimum match length

Options for minimum match length and expected read size for a would set the occupancy of the filter but would be superceed by a minimum global occupancy if the parameter is set.

How to create filters for many reference genomes?

Hi,
I want to detect if there is any virus in my sample. I have hundreds of virus reference genomes or sequences. Do I have to build hundreds of filters?
I have two other questions. one is the difference between "shared" and "multiMatch" in the summary file, and the other is when I build a filter of human reference genome, there is some information as below:

The ratio between redundant k-mers and unique k-mers is approximately: 0.300735
Consider checking your files for duplicate sequences and adjusting them accordingly.
High redundancy will cause filter sizes used overestimated, potentially resulting in a larger than needed filter.
Alternatively you can set the number of elements wanted in the filter with (-n) and ignore this message

Can this information be ignored?
Thanks!

Create indexable output

Add option to output sam or paf output. There won't be alignment but the data could then be indexed for faster access for specific subsets using existing tools.

Use ntHash in Main program branch

ntHash has been used in some branches with great success in time, particularly for filter creation.

Incorporation of the new hash function will not be backward compatible with old versions of the code.

Gzip compression is very slow

The compression option has been reported as an order of magnitude slower than without. I am probably using zlib in a very inefficient way. Profiling and optimization for this option is needed.

Question for installation

Hi, Justin Chu
When I make the biobloom, it has the following error:

MIBFGen.hpp:834:46: error: use of deleted function std::basic_ofstream::basic_ofstream(const std::basic_ofstream&)...
/usr/include/c++/4.8.2/fstream:599:11: error: use of deleted function std::basic_ostream::basic_ostream(const std::basic_ostream&)....
.....

How do I solve this problem?
Thanks!

Fails with error: invalid variable name: `–-with-boost'

Is the --with-boost option actually implemented in configure? I will post if I find a solution

erenna:[/usr/local/bin/biobloom]$ sudo ./configure –-with-boost=/usr/lib/x86_64-linux-gnu/ --prefix=/usr/local/bin/biobloom/bin && sudo make install
configure: error: invalid variable name: –-with-boost'`

Add multiIndex Bloom Filter code to a release version

MultiIndex bloom filters are like counting bloom filters but do store values rather than counts. Unlike Bloomier filters, they do not use perfect hashing. More space efficient depending on k used than a hash table but may not be an ideal data-structure for 1 to many classifications because of value storage redundancy.

Update Linuxbrew version to 2.1.3

When 2.1.3 is released, installation recipe should be updated in linuxbrew.

Sdsl issue ?

Hi, I'm just an undergraduate student dipping his toes in bioinformatics and I must make this run for an obligatory assignment.
However I have an issue that I haven't seen reported here so I am assuming there is something that I am missing something obvious.
I run ./configure without any problems however when I run make I run into the following issue :
The program can't seem to find the sdsl files even though I have installed them properly. How can I solve this ?

Thanks in advance !

Run ntcard automatically before creating filter

Would add dependency on ntcard, however would make filter creation more consistent on differnt types of input into biobloommaker.

Fails to build on Yosemite with clang

Fails to build on Yosemite with clang.

clang++ -DHAVE_CONFIG_H -I. -I..  -I../BioBloomCategorizer -I../Common -I../DataLayer -I.. -I/private/tmp/biobloomtools-6ARKcS/biobloomtools-2.0.6   -isystem/private/tmp/biobloomtools-6ARKcS/biobloomtools-2.0.6/1_55_0 -Wall -Wextra -Werror -Wno-unknown-pragmas -g -O2 -c -o biobloomcategorizer-BioBloomClassifier.o `test -f 'BioBloomClassifier.cpp' || echo './'`BioBloomClassifier.cpp
In file included from ResultsManager.cpp:8:
./ResultsManager.h:24:32: error: reference to 'shared_ptr' is ambiguous
                        const unordered_map<string, shared_ptr<MultiFilter> > &filtersRef,
                                                    ^

Add option to exclude output of certain filters

Hi,

One of our use cases involves running a bunch of filters to detect non human reads in our samples. This involves running the sample with a bunch of microbial filters and a human filter. As we expect, the vast majority end up in the human bin which is of no interest in this case.

Could we get an option to exclude the hits that hit a certain filter from the output?

Improve behavior for progressive mode repeat/subtrative filter

Due to false positives in the subtractive filter, it can cause false recruitment terminations. This is not that evident when using small -r values (0.1-0.3) but should impact any -r large values and when they are integers.

Reasons for how this is bad: we will obliterate any k-mers that fall within 1/(FPR)bp. For Kollector, if we have a large genic space and use stringent population strategies (large -r values), it will cause an abrupt termination of k-mers ever 1/(FPR) bases. We may compensate by bridging across these gaps using paired end information but I suspect these false terminating k-mers to inhibit filling of these gaps, promoting some off-target behavior.

A better scheme would be to:

store subtractive filter as an exact data-structure (hash table), or
change the scoring algorithm in progressive mode, or
treat repeat filter hits more leniently (not as non-matches if they are near non-repeat hits).

intermittent I/O errors when writing large files (patch included)

I have observed frequent I/O errors when writing large files with BBT. Upon reading the saved file, I will see errors like:

bin.seed_mp.bf does not match size given by its information file. Size: 25862192416 vs 12977290528 bytes.

It looks like it some kind of bug with the C++ library implementation on our HPC cluster (CentOS 5 cluster using GPFS).

I have observed that changing the C++ I/O calls to equivalent C I/O calls solved the problem. Unfortunately github does not allow you to attach files with the .patch or .diff extensions. Instead, here is the patch pasted inline:

diff --git a/Common/BloomFilter.cpp b/Common/BloomFilter.cpp
index e459290..6e4fd4d 100644
--- a/Common/BloomFilter.cpp
+++ b/Common/BloomFilter.cpp
@@ -148,16 +148,16 @@ bool BloomFilter::contains(const unsigned char* kmer) const
  */
 void BloomFilter::storeFilter(string const &filterFilePath) const
 {
-       ofstream myFile(filterFilePath.c_str(), ios::out | ios::binary);
+       FILE *out = fopen(filterFilePath.c_str(), "wb");
+       assert(out != NULL);

        cerr << "Storing filter. Filter is " << m_sizeInBytes << "bytes." << endl;

-       assert(myFile);
-       //write out each block
-       myFile.write(reinterpret_cast<char*>(m_filter), m_sizeInBytes);
+       fwrite((const void*)m_filter, sizeof(char), m_sizeInBytes, out);
+       if (ferror(out))
+               perror("Error writing file");

-       myFile.close();
-       assert(myFile);
+       fclose(out);
 }

 unsigned BloomFilter::getHashNum() const

You may copy the above text to a file called io.patch and then apply it in the root BBT directory with:

$ patch -p1 < io.patch

Or I can just push the commit to develop if you like. (I have been using this change for quite a while without any problems.)

error: 'citycrc.h' file not found

clang++ -DHAVE_CONFIG_H -I. -I..  -I.. -I/private/tmp/biobloomtools-s1EJOa/biobloomtools-2.0.6   -isystem/private/tmp/biobloomtools-s1EJOa/biobloomtools-2.0.6/1_55_0 -Wall -Wextra -Werror -g -O2 -c -o libcommon_a-Dynamicofstream.o `test -f 'Dynamicofstream.cpp' || echo './'`Dynamicofstream.cpp
city.cc:498:10: fatal error: 'citycrc.h' file not found
#include <citycrc.h>
         ^

error: required file './COPYING' not found

==> ./autogen.sh
+ aclocal
+ autoconf
+ autoheader
+ automake -a
configure.ac:16: installing './compile'
configure.ac:6: installing './install-sh'
configure.ac:6: installing './missing'
BioBloomCategorizer/Makefile.am: installing './depcomp'
Makefile.am: installing './INSTALL'
Makefile.am: error: required file './COPYING' not found
/usr/local/Library/Homebrew/debrew.rb:11:in `raise'
BuildError: Failed executing: ./autogen.sh