Giter VIP home page Giter VIP logo

Comments (10)

frederic-mahe avatar frederic-mahe commented on June 14, 2024

A sorting of the sub-seeds by decreasing abundance is necessary for the breaking phase. It guarantees an exploration of the amplicon-space "abundant amplicons first".

Let's imagine a simple example: A, B, C, D, where abundances are a such A > B > C >D, and possible (d = 1)-edges as such: A - B, A - D, B - C, B - D, C - D. Without sorting, we could obtain the minimum spanning tree A - D - {B, C}, that goes through a valley (D) before connecting B. With sorting, we obtain the tree A - {B - C, D}, that nicely follows the abundance slope. Please, take a pen and make a doodle: it is much easier to understand graphically.

A stable sorting would be a plus, as it would guarantee swarm's repeatability (if the strict order of amplicons in output results matters).

from swarm.

torognes avatar torognes commented on June 14, 2024

I see. I will rewrite the code to sort the amplicons.

from swarm.

frederic-mahe avatar frederic-mahe commented on June 14, 2024

To be able to merge the breaking phase within the growth phase, we need amplicons to be sorted by decreasing abundance. The clustering process should start with the most abundant amplicon in the dataset, and continue with the second, third, etc. Therefore, I think we also need sorting before the beginning of the clustering.

If the user always provides swarm with datasets sorted by decreasing abundance, then maybe an ordered mapping is the data structure we need. If we do not trust the input to be sorted properly, swarm can systematically perform a sorting. However, sorting seems to be a time consuming task, especially with the very large datasets we are dealing with. We could imagine to check the sorting while parsing the fasta file:

if former_abundance <= present_abundance && sorted=TRUE ; then sorted=FALSE

If at the end of the fasta parsing, sorted=TRUE, then there is no need to sort again. That should be the most frequent case if our users follow the instructions in the documentation.

Initially, I thought we should sort amplicons by decreasing abundance and then by alphabetical order (of their nucleotidic sequences) to unsure a stable sorting. Now, I think this is not necessary, a sorting by decreasing abundance should be sufficient. Regarding the storage of the sorting results, I would guess a chained list of hash values, with memorization of the last position (to avoid reading the list from the top each time). Am I right?

from swarm.

torognes avatar torognes commented on June 14, 2024

I agree that we could check if the amplicons is sorted by abundance as we read the FASTA file as you have described. If not, we could sort it within SWARM. It is relatively fast, and quite simple to implement.

Perhaps we should also allow abundances to be formatted as with uclust (>id;size=123) in addition to the current format (>id_123)?

With the old algorithm (d>1), the list of matches is always sorted before proceeding to the next subseed. However, this does not guarantee that the amplicons are sorted at each generation in the output. If there is just two generations (0 and 1), they are sorted. But if there are more generations (0, 1, 2...) then the amplicons in generation 2 are not necessarily fully sorted, only partly. The amplicons in generation 2 found from the first amplicons in generation 1 will always come before the amplicons in generation 2 found from the last amplicons in generation 1. Is this a problem? Should we sort the output so that all amplicons from each generation are sorted? Or should we sort the entire swarm by abundance?

from swarm.

frederic-mahe avatar frederic-mahe commented on June 14, 2024

For the uclust-style, if you can implement it easily in the fasta parsing function, please go ahead.

For d > 1, the case you describe is not a problem. I think the best strategy is to compute the d > 1 cases, by doing a d = 1 clustering first, and then search for d = 2, d = 3, etc, as you suggested a few weeks ago. Depending on our breaking strategy (parametric or not), we might be able to avoid a lot of comparisons and to keep it fast, even for high d values. But let's focus on the d = 1 first, I think a good strategy will emerge naturally from it for the d > 1 cases.

from swarm.

torognes avatar torognes commented on June 14, 2024

Fixed. SWARM 1.2.15 sorts the input sequences in order of decreasing abundance unless they are detected to be sorted already. (uclust-style abundances were already allowed in an earlier version.)

When using the alternative algorithm (-a) for d=1 it also sorts all subseeds in order of decreasing abundance. This should solve the issues described.

Multiple sequences having the same abundance are left in the original order. It is very simple to sort them in alphabetic order if that would be advantageous.

A test using a shuffled "huge" dataset confirms that the number of clusters is identical and that the size of the largest cluster is identical to the unshuffled "huge" dataset, however the "max generations" (subseed depth or radius) varied (60 or 63). This is assumed to be due to small differences in the order.

from swarm.

torognes avatar torognes commented on June 14, 2024

Oops! In version 1.2.15 the abundance sort was performed in the opposite order. Fixed in 1.2.16.

from swarm.

frederic-mahe avatar frederic-mahe commented on June 14, 2024

Yes, the shape of the tree spanning the cluster depends on the order of the amplicons. Hence the variation in the "max generation" value you observed. Adding alphabetical sorting should not be necessary. I think we can close the issue.

from swarm.

torognes avatar torognes commented on June 14, 2024

Results after running swarm_breaker.py is different for d=1 depending on whether the default or the alternative swarm algorithm is used. I think this is due to different ordering of the amplicons. With the default algorithm, subseeds are always processed and output in a abundance-sorted order for each generation. All amplicons at generation x are processed before those of generation x+1, which is a kind of breadth-first search. With the alternative algorithm this is currently not always the case, as it performs a depth first search. In both cases swarm will start with the most abundant subseed. The initial swarms will be the same, but after the breaking step the results may be different.

from swarm.

torognes avatar torognes commented on June 14, 2024

In SWARM 1.2.20 the alternative algorithm now explores the space in a breadth-first order, generation by generation, and always in order of decreasing abundance within each generation.

from swarm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.