Giter VIP home page Giter VIP logo

Comments (21)

benjjneb avatar benjjneb commented on September 28, 2024 3

I've just run some initial tests on the bioconda version of DADA2 1.10 built with GCC7, and it appears that the speed penalty versus natively installed DADA2 is now essentially gone.

I will do some more detailed testing, but this would be a major quality-of-life improvement for folks using DADA2 via QIIME2 (up to a 10x speedup) and strongly suggests skipping the 1.8 package version and going straight to 1.10. Is that in the realm of possibility for the next Q2 release?

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024 2

This won't get to release until the next Bioconductor release (BioC 3.7) which will be around April. Given that there will also be a delay in the BioC release propagating to bioconda, this would make sense for a June/July Q2 release.

from q2-dada2.

epruesse avatar epruesse commented on September 28, 2024 2

Bioconda is now at 1.10: http://bioconda.github.io/recipes/bioconductor-dada2/README.html

@benjjneb Is the speed penalty still there?

This blocks qiime2/q2-alignment#64

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024 2

I wonder if the recent bump to gcc7 on conda-forge and bioconda will help us out here...

Didn't realize this had happened! I will definitely re-profile the speed issue with the 1.10 bioconda builds. fingers crossed would be awesome if the performance delta narrows.

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024 2

@benjjneb Did you solve the segfault error or it didn't happen with your samples?

Works fine in my initial testing, but I haven't tried to reproduce the reported bioconda-specific segfault yet. Will try soon.

Yep! There's still a full month or so on the next release. The dates are also probably something we can flex if it comes down to it, the speed improvement would certainly be worth it.

OK, I am going to move forward with testing and updating the Q2 scripts based on package version 1.10 then. I'll need help from your side figuring out the Q2 conda recipes and any python updates to the plugin though.

from q2-dada2.

epruesse avatar epruesse commented on September 28, 2024 2

Ok updates - my best guess is a concurrency bug somewhere in dada2. See the issue at dada2 repo.

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024 1

Still waiting on the May Bioconductor release to propagate to bioconda. I think progress will show up on this issue: bioconda/bioconda-recipes#8947

Once it migrates, I'm planning to update the R scripts, and to address several other outstanding issues at the same time including #87, #93, #97, #99

And note to self: Add an extra step catching and removing very low depth sequences before learning the error rates.

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024 1

Bioconda is now at 1.10: http://bioconda.github.io/recipes/bioconductor-dada2/README.html

Interesting. It would actually be easier to just jump ahead to 1.10 as the edge case merging bugs were mostly worked out between 1.8 and 1.10.

@benjjneb Is the speed penalty still there?

It will be a little better than now, but as far as I coiuld tell the speed penalty is pretty hard to overcome completely, because it is a result of how bioconda compiles code (old version of gcc, low levels of optimization to handle generic hardware).

from q2-dada2.

epruesse avatar epruesse commented on September 28, 2024 1

That's what I meant. If it's not -march dependent, then it should now all be OK.

Having -march=native compiled stuff won't work for Bioconda. We've had the discussion in a few places, but just picture a user with miniconda in $HOME and doing qsub on a cluster with a few generations of compute nodes added over time. Even on Travis you'll sometimes run into different CPUs.

The reason why I can't easily build a SINA version that matches the pinned libs is that we followed Conda-Forge in the move to GCC7. Since that brought the ABI change required to support C++11, all C++ libs except for libstdc++ are incompatible with the old ones.

from q2-dada2.

epruesse avatar epruesse commented on September 28, 2024 1

Didn't realize this had happened!

It's still happening I suppose. We are rebuilding more or less on an as-needed-basis because of resource constraints. Conda-Forge had been rebuilding for a while, and this January did "the label switch". That means all the GCC7 package previously in conda-forge/label/gcc7 became available through conda-forge. We tagged our pre-gcc7 state and started rebuilding. So far no major disasters have happened fingers crossed.

from q2-dada2.

ebolyen avatar ebolyen commented on September 28, 2024 1

Is that in the realm of possibility for the next Q2 release?

Yep! There's still a full month or so on the next release. The dates are also probably something we can flex if it comes down to it, the speed improvement would certainly be worth it.

from q2-dada2.

thermokarst avatar thermokarst commented on September 28, 2024

@benjjneb, can you let us know which bioconductor release cycle you expect this to be available in, that way we can track it in the appropriate QIIME 2 release cycle. Thanks!

from q2-dada2.

ebolyen avatar ebolyen commented on September 28, 2024

@benjjneb, in the new granular pipeline scheme, would it be possible to provide a way to handle sample pooling? This came up recently in the context of breakaway and alpha diversity estimation.

breakaway in particular requires singletons to exist, if there was a way to conditionally return FeatureTable[Frequency % Properties('singletons')] we would be able to enforce that assumption (when sample pooling is true or pseudo at least).

We're going to be trying to address TypeMap and similar typing related things in the next release cycle. If I need to implement a way to dependently type the output, that should be possible (especially with a real use-case now).

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024

@benjjneb, in the new granular pipeline scheme, would it be possible to provide a way to handle sample pooling? This came up recently in the context of breakaway and alpha diversity estimation.

Yes. Although given the frustrating speed penalty that bioconda-dada2 has, pseudo-pooling might make more sense.

if there was a way to conditionally return FeatureTable[Frequency % Properties('singletons')] we would be able to enforce that assumption (when sample pooling is true or pseudo at least).

We're going to be trying to address TypeMap and similar typing related things in the next release cycle. If I need to implement a way to dependently type the output, that should be possible (especially with a real use-case now).

Have to admit, I don't really follow this. Basically option to a different return type if pool=TRUE (or pool=psuedo)?

from q2-dada2.

ebolyen avatar ebolyen commented on September 28, 2024

Yep! Basically we'd just figure out when that property can be added, and tools that need it can require it as an input. Everything else will continue not caring like normal.

from q2-dada2.

thermokarst avatar thermokarst commented on September 28, 2024

old version of gcc

I wonder if the recent bump to gcc7 on conda-forge and bioconda will help us out here...

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024

@epruesse A brief update, we have started getting segfault error reports for the dada2 package that appear to be coming from conda-installed versions of the 1.10 package, i.e. the new package version that is being built with GCC7. See for example: benjjneb/dada2#684

from q2-dada2.

apcamargo avatar apcamargo commented on September 28, 2024

@benjjneb Did you solve the segfault error or it didn't happen with your samples?

from q2-dada2.

ebolyen avatar ebolyen commented on September 28, 2024

I'll need help from your side figuring out the Q2 conda recipes and any python updates to the plugin though.

Sounds like a plan! Feel free to structure the R scripts into as many or few as you need. I can adapt the Python side to work correctly. I think we had talked about breaking up the denoise-* methods into multiple actions before (and maybe even including taxonomic assignment?). We have pipelines now, so I can re-compose the original denoise-* methods in Python if you think that's worth attempting with this.


Re segfault: Is it possible that it's just a blanket runtime conflict? e.g. glibc on CentOS 5 vs anything compiled in the last decade?

from q2-dada2.

benjjneb avatar benjjneb commented on September 28, 2024

Just wanted to ping this thread with the DADA2 issue that is open on segfaults with bioconda-dada2-1.10: benjjneb/dada2#684

In testing on my local machine, the bioconda install of 1.10 seems fine, but clearly there is something going on here because there are multiple people with the same report.

from q2-dada2.

epruesse avatar epruesse commented on September 28, 2024

It might be not just 1.10.

(segfault in 1.8) bioconda/bioconda-recipes#13847
(just references) bioconda/bioconda-recipes#13776

In 1.8, it appears to have been a libc++ issue? I can't tell. We need someone proficient in R for this.

from q2-dada2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.