I uncovered an odd bug where the results seem to be affected by the FASTA headers in the reference sequences. Here is a minimal example:
Data
ref.fa:
>ERR365834.4916141 HWI-M01013:41:000000000-A5KUT:1:2107:9861:26375/1
ACCTTAAAGCTTTTTACCCCTCTTAAGTTATATTTCATACAAAACACAACACACACAAAAAATTGCCCCTTAAAGAAAAGTCACAAGATTTAATATATATGCCAAATTATTGAACCTATCATCCTACTATATTATTTGGGAAGTACATTTTAAATGTAACCTTAATTTTTGTAAGTAATTACATAGGTATGCCA
>ERR365834.4916141 HWI-M01013:41:000000000-A5KUT:1:2107:9861:26375/2
TAACAACATCAAATTTGACCACGAGGTCATACCTGAAATGGATTGCTGTGAAACCGGTACAGACTCTTTCACAGCTATTCAGAACTGCACTCAGTTAGCTGATAAAATTTGCAAGTCTTCGTGGGGGCATGCAGATGATGTGCGTATTGACCAATATAG
ref.renamed.fa (same sequences as above, with headers renamed):
>ERR365834.4916141/1
ACCTTAAAGCTTTTTACCCCTCTTAAGTTATATTTCATACAAAACACAACACACACAAAAAATTGCCCCTTAAAGAAAAGTCACAAGATTTAATATATATGCCAAATTATTGAACCTATCATCCTACTATATTATTTGGGAAGTACATTTTAAATGTAACCTTAATTTTTGTAAGTAATTACATAGGTATGCCA
>ERR365834.4916141/2
TAACAACATCAAATTTGACCACGAGGTCATACCTGAAATGGATTGCTGTGAAACCGGTACAGACTCTTTCACAGCTATTCAGAACTGCACTCAGTTAGCTGATAAAATTTGCAAGTCTTCGTGGGGGCATGCAGATGATGTGCGTATTGACCAATATAG
query.fa:
>SRR519624.2022476
TAGTAGGATGATAGGTTCAATAATTTGGCATATATATTAAATCTTGTGACTTTTCTTTAAGGGGCAATTTTTTGTGTGTGTTGTGTTTTGTATGATATAT
Results
Results of querying ref.fa (NO HIT!):
$ samtools faidx ref.fa
$ biobloommaker -k 15 -p ref -f 0.001 -n 100000 ref.fa
$ biobloomcategorizer -d ref -f ref.bf -s 0.2 query.fa 2>&1 >/dev/null | tail -5 | column -t
filter_id hits misses shared rate_hit rate_miss rate_shared
ref 0 1 0 0 1 0
multiMatch 0 1 0 0 1 0
noMatch 1 0 0 1 0 0
Results of querying ref.renamed.fa (HIT (the correct answer!)):
$ samtools faidx ref.renamed.fa
$ biobloommaker -k 15 -p ref.renamed -f 0.001 -n 100000 ref.renamed.fa
$ biobloomcategorizer -d ref.renamed -f ref.renamed.bf -s 0.2 query.fa 2>&1 >/dev/null | tail -5 | column -t
filter_id hits misses shared rate_hit rate_miss rate_shared
ref.renamed 1 0 0 1 0 0
multiMatch 0 1 0 0 1 0
noMatch 0 1 0 0 1 0
Explanation
I suspect that the problem is due to a quirk of samtools faidx
and is not BioBloomTools' fault. For example, compare the following two files:
ref.fa.fai:
ERR365834.4916141 159 333 159 160
ERR365834.4916141 159 333 159 160
ref.renamed.fa.fai:
ERR365834.4916141/1 194 21 194 195
ERR365834.4916141/2 159 237 159 160
So it is important to make sure that all of the FASTA IDs for the reference sequences are unique. I think for most users that will be the case, but in my application I am using BioBloomTools to map read pairs to read pairs and this problem crops up.
If the issue can't be fixed, I recommend putting some kind of warning in the README about making sure the FASTA names are unique.