With the new approach for d=1, the order of amplicons in each swarm may be unstable wh

Unstable order of amplicons in swarms with new approach for d=1 with >1 thread about swarm HOT 10 CLOSED

torognes commented on June 14, 2024

Unstable order of amplicons in swarms with new approach for d=1 with >1 thread

from swarm.

Comments (10)

frederic-mahe commented on June 14, 2024

A sorting of the sub-seeds by decreasing abundance is necessary for the breaking phase. It guarantees an exploration of the amplicon-space "abundant amplicons first".

Let's imagine a simple example: A, B, C, D, where abundances are a such A > B > C >D, and possible (d = 1)-edges as such: A - B, A - D, B - C, B - D, C - D. Without sorting, we could obtain the minimum spanning tree A - D - {B, C}, that goes through a valley (D) before connecting B. With sorting, we obtain the tree A - {B - C, D}, that nicely follows the abundance slope. Please, take a pen and make a doodle: it is much easier to understand graphically.

A stable sorting would be a plus, as it would guarantee swarm's repeatability (if the strict order of amplicons in output results matters).

from swarm.

torognes commented on June 14, 2024

I see. I will rewrite the code to sort the amplicons.

from swarm.

frederic-mahe commented on June 14, 2024

To be able to merge the breaking phase within the growth phase, we need amplicons to be sorted by decreasing abundance. The clustering process should start with the most abundant amplicon in the dataset, and continue with the second, third, etc. Therefore, I think we also need sorting before the beginning of the clustering.

If the user always provides swarm with datasets sorted by decreasing abundance, then maybe an ordered mapping is the data structure we need. If we do not trust the input to be sorted properly, swarm can systematically perform a sorting. However, sorting seems to be a time consuming task, especially with the very large datasets we are dealing with. We could imagine to check the sorting while parsing the fasta file:

if former_abundance <= present_abundance && sorted=TRUE ; then sorted=FALSE

If at the end of the fasta parsing, sorted=TRUE, then there is no need to sort again. That should be the most frequent case if our users follow the instructions in the documentation.

Initially, I thought we should sort amplicons by decreasing abundance and then by alphabetical order (of their nucleotidic sequences) to unsure a stable sorting. Now, I think this is not necessary, a sorting by decreasing abundance should be sufficient. Regarding the storage of the sorting results, I would guess a chained list of hash values, with memorization of the last position (to avoid reading the list from the top each time). Am I right?

from swarm.

torognes commented on June 14, 2024

I agree that we could check if the amplicons is sorted by abundance as we read the FASTA file as you have described. If not, we could sort it within SWARM. It is relatively fast, and quite simple to implement.

Perhaps we should also allow abundances to be formatted as with uclust (>id;size=123) in addition to the current format (>id_123)?

With the old algorithm (d>1), the list of matches is always sorted before proceeding to the next subseed. However, this does not guarantee that the amplicons are sorted at each generation in the output. If there is just two generations (0 and 1), they are sorted. But if there are more generations (0, 1, 2...) then the amplicons in generation 2 are not necessarily fully sorted, only partly. The amplicons in generation 2 found from the first amplicons in generation 1 will always come before the amplicons in generation 2 found from the last amplicons in generation 1. Is this a problem? Should we sort the output so that all amplicons from each generation are sorted? Or should we sort the entire swarm by abundance?

from swarm.

frederic-mahe commented on June 14, 2024

For the uclust-style, if you can implement it easily in the fasta parsing function, please go ahead.

For d > 1, the case you describe is not a problem. I think the best strategy is to compute the d > 1 cases, by doing a d = 1 clustering first, and then search for d = 2, d = 3, etc, as you suggested a few weeks ago. Depending on our breaking strategy (parametric or not), we might be able to avoid a lot of comparisons and to keep it fast, even for high d values. But let's focus on the d = 1 first, I think a good strategy will emerge naturally from it for the d > 1 cases.

from swarm.

torognes commented on June 14, 2024

Fixed. SWARM 1.2.15 sorts the input sequences in order of decreasing abundance unless they are detected to be sorted already. (uclust-style abundances were already allowed in an earlier version.)

When using the alternative algorithm (-a) for d=1 it also sorts all subseeds in order of decreasing abundance. This should solve the issues described.

Multiple sequences having the same abundance are left in the original order. It is very simple to sort them in alphabetic order if that would be advantageous.

A test using a shuffled "huge" dataset confirms that the number of clusters is identical and that the size of the largest cluster is identical to the unshuffled "huge" dataset, however the "max generations" (subseed depth or radius) varied (60 or 63). This is assumed to be due to small differences in the order.

from swarm.

torognes commented on June 14, 2024

Oops! In version 1.2.15 the abundance sort was performed in the opposite order. Fixed in 1.2.16.

from swarm.

frederic-mahe commented on June 14, 2024

Yes, the shape of the tree spanning the cluster depends on the order of the amplicons. Hence the variation in the "max generation" value you observed. Adding alphabetical sorting should not be necessary. I think we can close the issue.

from swarm.

torognes commented on June 14, 2024

Results after running swarm_breaker.py is different for d=1 depending on whether the default or the alternative swarm algorithm is used. I think this is due to different ordering of the amplicons. With the default algorithm, subseeds are always processed and output in a abundance-sorted order for each generation. All amplicons at generation x are processed before those of generation x+1, which is a kind of breadth-first search. With the alternative algorithm this is currently not always the case, as it performs a depth first search. In both cases swarm will start with the most abundant subseed. The initial swarms will be the same, but after the breaking step the results may be different.

from swarm.

torognes commented on June 14, 2024

In SWARM 1.2.20 the alternative algorithm now explores the space in a breadth-first order, generation by generation, and always in order of decreasing abundance within each generation.

from swarm.

Unstable order of amplicons in swarms with new approach for d=1 with >1 thread about swarm HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent