Store sequences using 2 bits instead of 8 bits per nucleotide. Will allow larger datas

As a matter of fact, I have a <a href="https://github.com/frederic-mahe/swarm-tests/bl

Awesome work <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Some benchmark results: <div class="highlight highlight-source-shell notranslate p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Store sequences using 2 bits,about torognes/swarm

Comments (55)

lczech commented on June 14, 2024 5

Okay, just a few tiny (but important) changes brought the run time from 16s to 2s on my machine:

The functions hash_insert and particularly find_variant_matches took a TwobitVector copy instead of a const&. Thus, the vectors were copied in every iteration (that is, for every microvariant!). Changing this to TwobitVector const& seq gives speedup.
The iterators in generate_variants used post increment instead of pre increment, making yet another copy of the vector in every iteration. Changing them from its++ to ++its yields even more speedup.
Lastly, for good measure, my pre increment implementation also returned an unnecessary copy instead of a reference (as required). So, in all three iterators, changing self_type operator ++ () to self_type& operator ++ () eliminated the last copy of the vectors per microvariant.

As always, the devil is in the detail... So now, instead of 3 unneeded copies of the TwobitVector per microvariant, 0 copies are made.

However, I still do not understand why this was not optimized by the compiler. At least the last two optimizations should be trivial for any decent compiler. My guess is that this is an issue with Link Time Optimization. I tried the respective flag (-flto), but it didn't do much. Still, maybe the Makefile can be adapted to allow for more optimization. Not sure what needs to be done though.

from swarm.

frederic-mahe commented on June 14, 2024 3

As a matter of fact, I have a test checking that all possible microvariants are produced for a given nucleotide sequence. You are right, it might not cover all cases for all possible sequences though.

One advantage is that swarm's code and my tests are written by two different persons (each person assumptions will vary, which can lead to the discovery of bugs or ambiguities). Having a third person writing tests would be even better. So, if you want to contribute unit-tests, please join the fun :-)

from swarm.

torognes commented on June 14, 2024 3

I've done some further improvements to the code in the dev branch that improves speed further. It is now almost 5x faster than the code in the master branch, with one thread.

It is now using a small blocked Bloom filter also outside of Fastidious mode, in front of the hashing. It uses only 2-4 bytes per sequence, but speeds things up considerably. This improved Bloom filter variant can probably be used for fastidious mode as well for a little speedup.

I am very happy it has worked out so well and must admit that my concerns about the complexity of the bit manipulations have been shown to not be a problem. The reduced size of the data and the fast hash updates have clearly been very beneficial.

I see many hash collisions, probably due to some substitutions spaced 32 nucleotides away that generates the same hash value, but they are not a big problem. I need to mix the bits in the hash a bit before it is used as an hash table index or Bloom bitmap index though. I have experimented with a Zobrist hash function that uses a random 64-bit pattern for each base in each position that are XOR'ed together. It generates very nice hashes with no collisions at all, but is overall still a little bit slower because they are a bit more time consuming to compute.

from swarm.

torognes commented on June 14, 2024 3

Added fastidious support to the Zobrist branch and improved performance. Fastidious mode seems to be about 6 times faster than earlier while ordinary clustering is about 9 times faster than earlier. Fastidious mode still requires large amounts of memory to store the Bloom filter. The amount of memory used can be adjusted with the --bloom-bits or --ceiling options as before, but small Bloom filters will degrade speed substantially.

from swarm.

frederic-mahe commented on June 14, 2024 3

Awesome work @torognes and @lczech !

I am going to close this thread to reduce the list of open issues. Feel free to open it again if need be. We are getting closer to release swarm 3.0 and to get a first draft of the publication (more on that soon).

from swarm.

frederic-mahe commented on June 14, 2024 2

Some benchmark results:

# download
DATASET="BioMarKs.fasta.bz2"
[[ -z ${DATASET} ]] && \
    wget https://github.com/torognes/vsearch-eval/raw/master/derep/data/${DATASET}

# decompress (to avoid adding decompression time to the clustering)
[[ -z ${DATASET/.bz2/} ]] && \
    bunzip2 -k ${DATASET}

# install Lucas' version
(git clone https://github.com/lczech/swarm.git
 cd swarm/src/
 make -j
 cp ./swarm ../../swarm3)

# clustering
for i in {1..3} ; do
    THREADS=1
    for SWARM in ./swarm3 $(which swarm) ; do
        /usr/bin/time \
            -f "\t%e s, %M kilobytes" \
            "${SWARM}" \
            -t ${THREADS} \
            -o /dev/null \
            -l /dev/null \
            -z BioMarKs.fasta 
    done
done

with @lczech's optimizations, swarm now runs 64% faster (72 s against 200 s to clusterize BioMarKs three times). That's nearly 3 times faster!

When using two threads, swarm2 is twice faster but swarm3 shows no gain.

from swarm.

frederic-mahe commented on June 14, 2024 2

@lczech I compared to the latest swarm release (the master branch, not the dev branch). So that's really cool. I figured that multithreading was not implemented yet. Regarding the maximum memory footprint, twice less memory is already a cool feat! I don't think 4 times less can be achieved though (compressed sequences end with padding, which should eat up a significant amount of memory in the end). If the memory gain and speedup were both 3x, that would be a cool marketing argument:

swarm 3 is 3 times faster and 3 times leaner!

from swarm.

torognes commented on June 14, 2024 2

I've just pushed a new branch called zobrist to the repo. I have made a number of changes there.

It uses Zobrist hashing as mentioned earlier and uses the computed hash for the Bloom filter and the hash table. This hash function results in no or extremely few collisions. When Swarm now generates variants it actually does not generate the actual sequences, it only computes the resulting hash. The hash and the variant info (seed, variant type, position and base) is saved, but not the sequence, because it is not really needed. That makes it faster. When checking successful hits in the hash, which are relatively few, the two amplicons are compared taking the variant info into account.

I found a better way to parallelise the code. Swarm now first generates a complete sequence network containing the links between all amplicons with a single nucleotide difference (neighbours). On average there have been about 3 neighbours per sequence in the datasets I have used. These are now very efficiently computed in parallel and stored. When the complete network is finished, the Swarm algorithm can quite quickly build the swarms from that information without parallelisation.

Some other minor improvements have been made as well, e.g. more efficient extraction of the abundance value from the headers (the regular expression matching was very slow).

Fastidious mode is not working yet.

In addition to using little memory, Swarm has now become extremely fast. I have tested in on the EMP small dataset used in the Swarm v2 paper with 155 million sequences. The old version used 1 hour and 45 minutes with 8 threads. On the same hardware, still with 8 threads, it now uses just 16 minutes. It uses 32GB of memory (41 GB earlier).

from swarm.

lczech commented on June 14, 2024 1

Nice! I thought about doing that as well, and had a look at the swarm code, but was not entirely sure where to start...

Anyway, in case you want to re-use some of the code that I already made for two-bit vectors, have a look at this class

and this set of functions: https://github.com/lczech/genesis/tree/master/lib/genesis/utils/math/twobit_vector

Particularly the iterators for insertions/deletions/substitutions might be useful, as they use some bit tricks to quickly go through the different two-bit variants representing different nucleotides (the core of swarm), and efficiently update the hash. Note: As the two-bit representation is dense, I decided to go with a hash representation that is more or less just the xor-ed 64bit words of the sequence. It is awesomely quick, and has enough entropy to avoid most collisions.

There is also test code that shows how the class and the functions are used: https://github.com/lczech/genesis/blob/master/test/src/utils/math/twobit_vector.cpp

Let me know if any of this seems useful, or if you have questions. And feel free to use it as needed.

from swarm.

frederic-mahe commented on June 14, 2024 1

Thanks @lczech for this detailed answer, and for the code :-) An even faster and leaner swarm will be very useful for the UniEuk project. I let you and @torognes have a Skype call if you need to discuss implementation details.

Regarding the 2-bit vector unpacking, that should be useful only when outputting the representative sequences for each OTU (any d), or for the d > 1 pairwise sequence comparisons. So the gain may be small in the end.

For packing (converting nucleotides into a bit-vector), there does not seems to be any gain in using anything else than a look up table for each nucleotide. I think this is what @torognes is using.

from swarm.

lczech commented on June 14, 2024 1

Okay, quick update. I copied the files to my fork of swarm, and changed the internals so that they fit into swarm. That is, the code is self-contained now, and license, namespaces, header guards etc are now fit for swarm. See here: https://github.com/lczech/swarm/tree/master/src/twobit_vector

Furthermore, I updated the Makefile, so that the new code gets compiled. Lastly, I wrote some examples showing how to use the classes and functions. Those are currently also compiled in the Makefile, but this can be removed later of course.

In my original code base, I also have tests. How do you do tests in swarm? I didn't see any testing environment in the repository.

EDIT: I forgot to mention: In the Makefile, I now set the standard to C++11, as this is what my code expects. The rest of swarm compiles fine with it, so I hope that this is an acceptable change.

from swarm.

torognes commented on June 14, 2024 1

Yes, I have seen your swarm fork. I will merge those changes directly into the main swarm repo.

That is, into the dev branch for now.

from swarm.

lczech commented on June 14, 2024 1

Wow, those are some promising results! So, to get this correct: The twobit version is 3x faster than the normal one? Or did you compare to @torognes first twobit version with those unnecessary vector copies?

Also, the twobit version currently does not use multiple threads, so there is no gain to be expected. @torognes wanted to keep it simple first, I think.

The resident size is only half as much? Shouldn't it be closer to a quarter? But maybe there are some fix memory allocations etc that do not scale. And also, maybe this is due to the database storing it in a different format before copying to twobit vectors? I didn't look at that code, so maybe @torognes can explain this better.

from swarm.

lczech commented on June 14, 2024 1

Haha, that would be a nice marketing slogan!

The Biomarks sequences are 382nt on average. Each 64bit word (which is the padding size) can store 32nt. Thus, on average, Biomarks needs 12 words to store a sequence. Padding adds at most 1 word to this, so padding should amount for less than 10% of the memory on this dataset. It is of course more significant for shorter sequences.

Edit: If we want to get really fancy, we could change the way the data is stored in TwobitVector. Currently, we use an std::vector (which needs 24bytes usually) and a size_t (8bytes) for the bookkeeping - that is, on top of the actual data. With some careful adaptations, we could bring this down to a pointer and a size, so, 16bytes instead of 32. This would however require a bit more math when computing sizes and positions in the vector. So, now the Biomarks data needs 12*8bytes data + 32bytes bookkeeping = 128byte per sequence. This could get down to 112bytes. Not sure if needed though.

from swarm.

frederic-mahe commented on June 14, 2024 1

Reducing the memory footprint is important. To give you an example, I am starting to deal with +100 GB fasta files (of unique sequences) and swarm 2.0 takes roughly 150 GB of memory to process that. The cluster I am working on is upgrading: most nodes will now be equipped with bi-Intel Xeon Gold 6126 processors (12 cores each) and 96 GB RAM. If swarm's footprint could be reduced significantly (by 50% or even 66%), it would allow me to perform most of my work using these stock nodes. However, in the end we will probably have to find a balance between speed and footprint.

I am looking forward to read about your profiling results, @lczech!

from swarm.

torognes commented on June 14, 2024 1

Have tested the new code now and commited the last changes into the dev branch. Seems to work excellent! The twobit-strategy really paid off!

I agree regarding splitting the insertions and deletion variants into three or four so that each iterator could go through the whole sequence and generate variants of the specified type. We would then up with up to 7 or 8 threads in total (1 for del, 3 for subst and 3 or 4 for insertions).

I did a simple profiling experiment with the whole BioMarKs file. 85% of the time is now spent generating variants and finding matches to them. 1/4 of that time is spent generating them and 3/4 on finding matches. I will look into how we could possibly improve this even more.

In addition to parallellization the fastidious mode also needs updating.

from swarm.

torognes commented on June 14, 2024

Performance likely to improve at least for d=1. We can perhaps handle 4 times as large databases.

from swarm.

torognes commented on June 14, 2024

Hashing sequences will be considerably faster and it will require less memory. Seems like a reasonable feature to implement.

from swarm.

frederic-mahe commented on June 14, 2024

Hi @torognes , hi @lczech,

I was uncertain that working with compressed sequences could be done, because I could not see how to differentiate padded bytes from shorter or longer sequences. Let me explain.

Let A = 00, C = 01, G = 10 and T = 11. Inside a byte, we can now store 4 nucleotides. And inside a machine word, we can now store 32 nucleotides (on a 64-bit architecture). But, what if the sequence we are compressing is not a multiple of 32 (or 4)? The last byte or the last word will only be partially filled, and we will have to pad it with zeros.

If we do so, how can we distinguish between ACG (00011000) and ACGA (00011000)? We can keep track of the actual length of each sequence, but does it mean we have to uncompress the sequence before passing it to the hash function? @lczech suggested to also pass the sequence length to the hash function. I think it completely removes the ambiguity introduced by the padding and makes each sequence unique again.

If we do so, the sequences will be compressed when read the first time, and decompressed at the very end, only if the -z output option is used. Microvariant generation, hashing, and sequence equalities would be computed on shorter sequences and would likely be much faster.

from swarm.

lczech commented on June 14, 2024

Hi all,

the padding is no issue for storing or comparing sequences, but needs to be kept in mind for hashing them. Storage inside a class encapsulates the details, so that the outside world only sees normal vector/array methods anyway.

@frederic-mahe already mentioned that for hashing, the size of the sequence can be used. My prototype simply calculates xor of the size with all 64bit words. Of course, this will not give a robust hash if there are only very short sequences (<32). For example, AA and T will both have the hash 10 (if the word is filled from the right).
However, comparing the sequences in order to resolve such hash collisions will be fast (word-wise comparison), and most sequences will be longer in practice. For longer sequences, the whole hash word is used, so I suspect that it will make good use of the hash space. Needs to be tested.
This is of course far from cryptographic, but for the intended use case, it might work well (and fast!).

Furthermore, there are other optimizations that I implemented in my prototype, which allow to do certain operations in constant time:

Most operations work on bits directly, thus they are fast. For example, moving from one ins/sub at a position to the next one is just a shift and a plus/xor.
The hash does not need to be recomputed for every microvariant, but can be computed from the previous hash value with just some more shift and xor.

So, memory and speed will be greatly improved. For sequences of ~400bp, I calculated the hash for all microvariants at 4,000-5,000 sequences/sec. With hash map lookup and collisions etc this will be slower of course.

In summary, I'd like to see how this optimization works in practice ;-)
What would be a good way to proceed with this?

Lucas

PS: As a geeky side note, we could even think of using the nucleotides in the order A, C, T, G. The ascii table looks like this

A 0100 0001
C 0100 0011
G 0100 0111
T 0101 0100

So, for reading them, it would work to use v = (c >> 1) & 3, without any if-branching.
However, reading \n-terminated sequences needs branching anyway (and probably, you want some sequence sanity checks), so this would be over-engineering ;-)

from swarm.

lczech commented on June 14, 2024

Okay, I did some testing on the hash issue.

One thing that I did not mention above is a repetition of hash values for certain substitutions. For example, the substitution of an A by a G in position 0 and in position 32 both gives the same hash, assuming that both are an A at that position. If they are not, it could be G substituted by T in position 0, and A substituted by C in position 32, again, same hash. And so on. For every position in [0, 31], there are 3 substitutions of the original nucleotide, and for each there is a doppelgänger in [32, 63] and all higher words.

This is systematic and poses a challenge for the hash function. They make up ~6% of all microvariants. However, I ran some tests with 15,000 random sequences each and did not get any other types of hash collisions (except for a minor variation of the above with a sequence that has an additional A at the end, but this is just once per sequence).

So, if we find a way to avoid this, the hash function is probably good to use. Maybe it even already is, if resolving the collisions is still fast enough.

from swarm.

torognes commented on June 14, 2024

Including the length when performing the hashing should separate the sequences as @frederic-mahe mentioned, so padding the sequences is no problem. The hash function suggested by @lczech looks fast and simple. There are also some other alternatives that could be considered. I do not think that it is a problem if substituting nucleotides in position 0 and 32 gives the same hash value. It is more important that simple mutations (e.g. small indels, snps) that we are likely to see in nature gives different hash values.

from swarm.

frederic-mahe commented on June 14, 2024

Additional question we can address later: is sequence compression useful for d > 1? Encoding nucleotides on two-bits is beneficial for d = 1, but is it feasible and beneficial for d > 1?

from swarm.

frederic-mahe commented on June 14, 2024

Hi @lczech , just to keep you posted @torognes started implementing the two-bit nucleotide encoding in the dev branch.

from swarm.

torognes commented on June 14, 2024

Thanks! I will look more into this, but so far it looks very useful. I just started writing code for deleting single compressed nucleotides, but I see that you have already implemented this as well as insertions and more, so I might just abandon that, and instead include or link with your code. Since both projects are GPL3 I assume that will be fine.

I am sure the hash calculations/updates will be extremely fast with an xor-based hash, and collision testing as well. The potential decrease in performance due to a simpler hash function will probably be negligible.

from swarm.

frederic-mahe commented on June 14, 2024

Hi @lczech

I went through your code and my brain exploded :-)

I noticed you wrote that in twobit_vector.cpp line 279:

// (Unfortunately, we are not operation on single bits, so a simple and or or
// does not work here. Maybe there are smarter ways, but this one works for now.)

(remember I have no idea what I am talking about) I wonder if a solution could be to distribute the two bits representing each nucleotides into two vectors: one vector will contain the upper bits, and the other one will contain the lower bits. That way, you can use single bit operators, but you have to apply them to two aligned bit-vectors.

More seriously, I imagine that unpacking (converting bits to nucleotides) could be done with a corresponding array, with 256 cells corresponding to the 256 possible byte-long bit-vectors, each cell containing one of the 256 possible tetranucleodides. That could be done for two-byte long vectors, but probably not higher. Sequences are then trimmed to their original length to remove padding. Is it how you would do it?

from swarm.

frederic-mahe commented on June 14, 2024

Hi @torognes

for the projected publication describing swarm 3.0, I was thinking about including @lczech as a co-author (especially if his code ends up being integrated). The author list could be shuffled a bit this time, with Torbjørn moving to first or last author (I'll be in the opposite position). I still need to discuss this with Micah, but I hope he won't be to much annoyed to loose his last author position :-)

from swarm.

lczech commented on June 14, 2024

That was quite some feedback ;-)
Let me try to answer step by step.

...and instead include or link with your code. Since both projects are GPL3 I assume that will be fine.

I wouldn't recommend to link to the complete repository, it's too much noise for swarm. Also, the code will need some minor adaptions, like namespaces, remove some logging stuff, etc. So instead, we can just copy the needed classes and functions to swarm. In fact, I wrote this code back when @frederic-mahe and I were in Roscoff and talked about swam. I never used it in my own projects, it was purely meant for swarm ;-)

The potential decrease in performance due to a simpler hash function will probably be negligible.

You mean, decrease in performance due to more collisions? Well, in its current form, the hash is 64 bits, which are fully used for anything longer than 32 nucleotides. So, it should have the same collision rate as any other 64bit hash. But of course, by design it is easier to construct hash collisions on purpose (which is no issue for our use case, so this is okay). Also, we can experiment with a different hash function that maybe uses 128 bit, and alternatingly fills them with the xor-ed words of the vector.

I noticed you wrote that in twobit_vector.cpp line 279

// (Unfortunately, we are not operation on single bits, so a simple and or or
// does not work here. Maybe there are smarter ways, but this one works for now.)

Haha yes, I remember that piece of code. It is the function that sets a 2-bit value at a certain position in the vector. Well, the "unfortunate" part of this is that we need an array lookup and two bit operations, instead of just one simple bit operation for normal 1-bit-vectors. The only alternative that I had in mind back then was to branch on the value to be set, which would probably be slower. So, it still should be quite fast, just not as fast as a 1-bit-vector set function. Maybe there is a way to get rid of the array lookup though.

I wonder if a solution could be to distribute the two bits representing each nucleotides into two vectors

Splitting the data into lower and upper bits would make the code way more complex in my opinion, and potentially slower, as each operation then requires access to two memory locations. Don't know if it is worth the effort to try this.

More seriously, I imagine that unpacking (converting bits to nucleotides) could be done with a corresponding array, with 256 cells corresponding to the 256 possible byte-long bit-vectors, each cell containing one of the 256 possible tetranucleodides.

That sounds like a very good idea! The current code for turning a string into a TwobitVector simply iterates the nucleotides, as this was the simplest way. Of course, this can and should be optimized the way you describe it! Maybe sans the trimming, and do some slow nt by nt translation for the last n % 4 nucleotides.

Okay, now there are quite some open questions. If you guys want, we can skype about this.

from swarm.

frederic-mahe commented on June 14, 2024

Hi @lczech

I wrote external tests. The idea was to consider swarm as a black box and to check if its output and behavior match what is expected. For what it's worth, I was able to test most of (all?) important uses of swarm, including some weird corner cases. All that without knowing C++ as would writing internal tests would have required!

from swarm.

lczech commented on June 14, 2024

Those tests are surely good to have! Still, I would consider internal unit-tests an important part of code validation. In the black-box setting, you never know what kind of stuff went wrong internally. For example, imagine that given a nucleotide sequence, one particular case of inserting a nucleotide is skipped in the code by accident. I doubt that this can be caught in black-box tests.

from swarm.

lczech commented on June 14, 2024

Re microvariants: Touché!

Re unit tests: I use Google Test, which is an open source test framework. It is not hard to integrate, at least with CMake. Will have to check how it can be done with just a Makefile. If you guys want, I can add integration on my swarm fork, and add the tests that I already have. Then, one could successively add more tests for other functions as well.

from swarm.

frederic-mahe commented on June 14, 2024

Re unit tests: I use Google Test, which is an open source test framework. It is not hard to integrate. If you guys want, I can add integration on my swarm fork, and add the tests that I already have. Then, one could successively add more tests for other functions as well.

Since I am not the one who's going to code the tests, I'll let @torognes decide if he wants to add unit-tests.

from swarm.

torognes commented on June 14, 2024

Thanks, @lczech, I'll start working on using your twobit_vector code in the swarm. I must say that the code looks very well written and documented.

I am also impressed with the entire genesis toolkit you have written. I have not been that good at documenting or testing (using unit tests) my own code, but I am trying to improve. @frederic-mahe has taken the initiative to the external testing.

Regarding the hash function:

The potential decrease in performance due to a simpler hash function will probably be negligible.

You mean, decrease in performance due to more collisions? Well, in its current form, the hash is 64 >bits, which are fully used for anything longer than 32 nucleotides. So, it should have the same >collision rate as any other 64bit hash. But of course, by design it is easier to construct hash >collisions on purpose (which is no issue for our use case, so this is okay). Also, we can experiment >with a different hash function that maybe uses 128 bit, and alternatingly fills them with the xor-ed >words of the vector.

I tested different 64-bit hash functions with Swarm earlier, including the Google CityHash function that we are using now and the Fowler–Noll–Vo hash function (FNV1a). I think we also tested MetroHash. Even though they are all 64 bit hash functions, I think we saw somewhat different performance with real data when used in hash tables and Bloom filters. What I mean is that although we do not need cryptograhy-grade hash functions here, I think there will be some differences in how good they are in distributing the sequences evenly across the entire 64-bit range. But as I said, if the hashing is much much faster those minor differences are probably not important.

from swarm.

lczech commented on June 14, 2024

Thanks, @lczech, I'll start working on using your twobit_vector code in the swarm. I must say that the code looks very well written and documented.

Thanks ;-)

You saw my earlier comment here about my swarm fork, where I put ready-to-use code of the implementation? I'm asking just to avoid that you replicate work that I already did for adapting the code.

As for the hashes: Okay, I see. That probably needs to be tested. We can also try some other simple hash functions later.

from swarm.

torognes commented on June 14, 2024

There is now a commit (0123fcb) in dev where clustering seems to work for d=1 using TwobitVector. It uses only one thread. Unfortunately it is about 5 times slower than before, so I was a bit disappointed. However, this is only the first iteration and there are many things that can be improved.

Here are some notes:

I modified the iterators for deletions and insertions to avoid generating duplicate variants as these cause trouble for the rest of the code. Avoiding the duplicate deletions is fairly straightforward. Avoiding the duplicate insertions is a bit more complex and the code may be optimized.
The sequences read from the input file are now stored in a compressed form with 2 bits per nucleotide and padded with zeros to fill a 64-bit word (as in TwobitVector), however it is not the native TwobitVector format and is converted later into that format, which mainly involves copying the data over as 64-bit words. This copying may take some time and may be one of the reasons for the performance degrade. It may be possible to use the TwobitVector format directly.
The sequence hashes are used as indicies in a hashtable that uses linear probing. It is quite sensitive to the quality of the hash function. Previously we used the CityHash function on the sequence as bytes. The index uses only the lower bits of the hash to select the bucket. E.g. with 1 million sequences, only the 21 lowest bits (out of 64) are used. Therefore we cannot use the new hash function directly because then only nucleotides 1-11, 33-43, 65-75, ... would influence the bucket selection. I experimented a bit with this and found it to be very important. I now pass the hash value through a 64-bit mixing function before the lowest bits are used. This seems to work well, but takes a little extra time and could be improved. The CityHash function is extremely fast and hard to beat. Maybe it is possible to modify the new hash function to avoid the extra step. I have though of something like Zobrist hashing which xors together a set of given pseudo-random values for each of the four nucleotides in each possible position, but it will probably be too slow.
Although the new iterators and bit-twiddling functions are effective, they probably take more time than just working on byte values directly.

from swarm.

frederic-mahe commented on June 14, 2024

Thanks for the update @torognes

I modified the iterators for deletions and insertions to avoid generating duplicate variants as these cause trouble for the rest of the code. Avoiding the duplicate deletions is fairly straightforward. Avoiding the duplicate insertions is a bit more complex and the code may be optimized.

I wonder if producing all microvariants, including duplicates, wouldn't be faster than trying to avoid duplicates. For a sequence of length l there are between 6l + 5 and 7l + 4 unique microvariants , and 8l + 4 microvariants total (including duplicates). If we do not try to avoid duplicates, that's a 15% increase in the number of microvariants, on average. Maybe a simpler code would run faster and compensate for these extra-microvariants?

from swarm.

lczech commented on June 14, 2024

Thanks for trying out the implementation! Have you tried to profile where the program spends most of the time? I had a look a the code, and apart from the copying of the two bit representation, didn't spot any obvious slow parts. Maybe the hashing and hash collisions cause an issue though. If you can point me to the dataset that you used for testing, I can try valgrind on it.

Although the new iterators and bit-twiddling functions are effective, they probably take more time than just working on byte values directly.

I don't know, needs to be tested, I think. The advantage of 64bit words is that this gives kind of a vectorization automatically, for the shifting stuff etc.

from swarm.

torognes commented on June 14, 2024

Haven't had time to profile it yet, but running valgrind will probably help a lot.

So far I have used a subset of 50 000 randomly subsampled sequences from the BioMarKs dataset for testing. These are fairly long sequences.

from swarm.

frederic-mahe commented on June 14, 2024

I suggest to use a fix seed for subsampling, so results can be compared:

vsearch --fastx_subsample BioMarKs.fasta.bz2 --sample_size 50000 --randseed 1 --quiet --fastaout -

from swarm.

lczech commented on June 14, 2024

@torognes, could you please post the command line that you use for testing with the BioMarks dataset? I want to make sure that I profile the twobit vector in a reproducible way.

from swarm.

frederic-mahe commented on June 14, 2024

Hi @lczech (@torognes is probably on holidays). Here how I would run things:

# download
wget https://github.com/torognes/vsearch-eval/raw/master/derep/data/BioMarKs.fasta.bz2

# subsample (take seq abundance into account)
vsearch --fastx_subsample BioMarKs.fasta.bz2 --sample_size 50000 --sizein --sizeout --randseed 1 --quiet --fastaout BioMarKs_subsample.fas

# clustering
THREADS=1
swarm -t ${THREADS} -z BioMarKs_subsample.fas > /dev/null

after 4-5 seconds, you should get that:

Database info:     9632593 nt in 25243 sequences, longest 531 nt

Number of swarms:  10241
Largest swarm:     446
Max generations:   10

from swarm.

lczech commented on June 14, 2024

All right, thanks @frederic-mahe, got it to work, with exactly those numbers in the output! However, why does the vsearch subsampling yield 25243 sequences instead of the specified 50k?

I'll run valgrind callgrind to get a call graph for profiling. Will report if I find anything interesting.

from swarm.

frederic-mahe commented on June 14, 2024

Subsampling yields 25,243 sequences, but each sequence represents a variable number of reads (the ;size=INT info in the fasta headers). If you sum up the abundance values, you should get a total of 50,000.

from swarm.

lczech commented on June 14, 2024

All right, just pushed the changes to my fork. My editor also eliminated trailing white space in the files, so there is some noise in there. Sorry for that.

from swarm.

lczech commented on June 14, 2024

Okay, next update: There are two more optimizations that we might consider:

Store TwobitVector in the database directly, instead of copying them word-by-word (db_get_seq_as_TwobitVector).
Make the iterators operate on the vector instead of on a copy.

I don't know how much speed those would yield, as both need to be done once per seed only, and not per microvariant. So, they are not called that often. Still, might be worth keeping in mind. But before implementing, we should profile a bit more - and maybe optimize some per-microvariant code first...

from swarm.

frederic-mahe commented on June 14, 2024

Maximum resident set size of the process during its lifetime, in Kilobytes is almost divided by two!

from swarm.

torognes commented on June 14, 2024

Even though I am on vacation, I could not resist commenting the very exciting news. I am happy to see the improvements you made @lczech!

Regarding memory usage: The fasta headers are also stored in memory and may take considerable amounts of memory. Especially when the sequences are compressed by a factor of four, the lengths of the headers become relatively more important. Perhaps they can also be compressed using a general text compression method.

Regarding parallellisation: The new code simply ignores more than 1 thread. Previously parallellisation was achieved by splitting the seed sequence into parts and letting each thread handle microvariants in each part of the sequence. For example, with two threads, the first thread would handle variants in the first half and the second thread would handle variants in the second half. The hits would be joined together before further processing. The reason for doing it that way was that the Swarm algorithm requires the seeds to be processed in a specific order (starting with the most abundant and proceeding with the next-most abundant and so on), so we cannot let the different threads handle different seeds. That approach worked reasonable well at least for long sequences and for few threads. If we should do it the same way now, I think we need the iterators to be able to start and end at specific positions in the sequence. It may also be possible to devise a different approach.

from swarm.

lczech commented on June 14, 2024

Some more profiling results:

I ran the profiling again to see where the program spends most of its time. In terms of speed, there are no more apparent bottlenecks or inefficiencies. From now on, we can do some smaller optimizations, but they won't give overwhelmingly more speedup, I guess.

Furthermore, I ran a heap analysis to see what the memory is used for:

The x-axis is time, the y-axis memory usage. The two greens and the orange are allocations due to the database reading. If I read this correctly, those are the names, the sequences, and the hashes. Not sure though. So, all of this is in code that I haven't read yet - @torognes can probably judge best where this can be optimized. Hence, my micro-optimization of a few bytes per twobit vector is probably premature - unless we decide to use those for storing the whole database.

As for parallelization: It should be possible to make the iterators start and end at some position in the vector, of course. Bit more overhead, but doable. Furthermore, one could think of splitting the three loops (subs/ins/dels) per thread, and take into account how many microvariants each of them yields. So 3x subs, 4x ins, 1x dels, if I am not mistaken - and distribute those as evenly as possible across threads.

from swarm.

lczech commented on June 14, 2024

I thought a bit more about the 8bit vs 64bit storage of the data. As @torognes wrote:

Although the new iterators and bit-twiddling functions are effective, they probably take more time than just working on byte values directly.

Not sure whether this is true. In both cases, some bit shifting is necessary to set or get or set a nucleotide at some position. But for comparing whole sequences, or inserting/deleting positions, the 64bit is faster, as fewer operations have to be executed. I don't see yet which operations would be faster with 8bit storage. @torognes, could you maybe explain your reasoning a bit more?

from swarm.

frederic-mahe commented on June 14, 2024

Regarding memory usage: The fasta headers are also stored in memory and may take considerable amounts of memory. Especially when the sequences are compressed by a factor of four, the lengths of the headers become relatively more important. Perhaps they can also be compressed using a general text compression method.

I am not sure text compression should be implemented for fasta headers. A simpler alternative is to recommend to use compact sequence names if memory is an issue. This is already what I do. I use sha1 hash values derived from sequences to name the sequences, and I could resort to md5 hash values if memory is really a problem (or even to ordinal numbers 1, 2, 3, 4, etc.).

By the way, "sha1" headers represent appr. 10% of the bytes in the BioMarKs.fasta file:

wc -c BioMarKs.fasta
132259483 BioMarKs.fasta
wc -c <(grep "^>" BioMarKs.fasta)
12829175 /dev/fd/63
wc -c <(grep "^>" BioMarKs.fasta | sed -r 's/;size=[0-9]+// ; s/^>//')
10312599 /dev/fd/63

A small memory-saving possibility for headers would be to remove the abundance information from the header string (which is recognized as ;size=INT when option -z is activated). I believe the header is stored whole as of now. Both the ;size= string and INT values can be recreated when outputting results, and it would shave off at least 7 chars from all headers (8 chars if you discard the >). On the BioMarKs.fasta, it would reduce the number of header bytes by 24%.

from swarm.

frederic-mahe commented on June 14, 2024

As for parallelization: It should be possible to make the iterators start and end at some position in the vector, of course. Bit more overhead, but doable. Furthermore, one could think of splitting the three loops (subs/ins/dels) per thread, and take into account how many microvariants each of them yields. So 3x subs, 4x ins, 1x dels, if I am not mistaken - and distribute those as evenly as possible across threads.

Naively, I would imagine that the iterator could be used to produce all the microvariants, that array would then be split into n-groups, each group could be passed to a thread to perform the computationally expensive tasks (hashing).

from swarm.

lczech commented on June 14, 2024

Re headers: If the sequences are compressed to 2bit per nt, they get reduced to 25% of their original size. Then, in comparison, the uncompressed headers make up ~30% of the size for biomarks, which is roughly what my heap allocation above shows. Compressing reduces this by half (I tried bz2, gz and zip).

Alternatively, if sequence names are guaranteed to be hashes, they only consist of symbols 0-9 a-f. This can be reduced by 50% by interpreting the names as hex coding, and storing 2 symbols per byte - that is, going back to the original hash number. But this might be too strict a requirement.

Lastly, yes, > and ;size=INT definitely seems to be worth removing ;-)

Re parallelization: The implementation does not store any of the microvariants, as this would just waste memory. For each microvariant, the tasks like hashing and comparing against the database are immediately performed. Also, with the new twobit vectors, hashing is really cheap. The expensive part is now the comparison against the database.

from swarm.

lczech commented on June 14, 2024

Another thought about parallelization:

Furthermore, one could think of splitting the three loops (subs/ins/dels) per thread, and take into account how many microvariants each of them yields. So 3x subs, 4x ins, 1x dels, if I am not mistaken - and distribute those as evenly as possible across threads.

It is semi-expensive to start an iterator for indels, as in both cases, the whole vector needs to be shifted (either to insert or to delete a nucleotide). Incrementing the iterator to the next ins/del however is really cheap. So, instead of having, say, 2 threads performing indels for the first and second half of the vector, rather have one thread do ins, and one do dels. Or, as there are a different number of ins and dels, find some good load balancing instead.

Lastly, a little bit can be saved by doing the iterations backwards. In this case, no shifting is needed. But as this only has to be done once per read, and only works for the base case (not splitting the indels among threads), it might not be worth the effort.

from swarm.

frederic-mahe commented on June 14, 2024

@torognes @lczech Awesome work on the two-bit strategy!

The two-bit strategy should speed up the fastidious mode (faster microvariant generation), but not reduce its memory footprint. I am really looking forward testing swarm 3.0. I am sitting on huge datasets waiting to be crunched :-)

from swarm.

lczech commented on June 14, 2024

Wow, this is impressive! Of course, the actual sequence is not needed when iterating microvariants - in hind sight, quite obvious. Well observed!

from swarm.

Store sequences using 2 bits about swarm HOT 55 CLOSED

Comments (55)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent