Giter VIP home page Giter VIP logo

Comments (3)

daveaglick avatar daveaglick commented on May 22, 2024

So once again, I'm back to wondering if I should just go ahead and implement parallelism for each pipeline. This was considered in #42 and now it's come up again, less than a week later. Thought process is that if we're going to support depth-first traversal, we're already going confuse the tracing output. And it probably makes sense just to do depth-first always for consistency and to minimize confusion. At that point, there's really nothing holding us back from just making the whole thing async.

Currently investigating the TPL Dataflow library for this:
https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx
http://www.michaelfcollins3.me/blog/2013/07/18/introduction-to-the-tpl-dataflow-framework.html

One big change (regardless of if we go async or jsync depth-first) would be that modules can no longer access the set of documents from their own pipeline. For example, if a blog post accesses other blog posts to display the next and previous ones, that wouldn't work because the first post would get all the way to the end of the pipeline before the second post is processed. Mitigation would be to read enough metadata for all posts in one pipeline, then continue processing in a second pipeline for layout. The switch in pipelines would act as a chokepoint, letting all posts get processed for metadata before continuing to layout and output.

from statiq.web.

daveaglick avatar daveaglick commented on May 22, 2024

Another problem: how to deal with modules that operate on all input documents as a unit? I.e., a (hypothetical) Aggregate module that combines all input documents into a single output document (come to think of it, should probably make this module)?

In Dataflow, there's no multiple-input, multiple-output - each block operates on a single input, regardless of the number of outputs. It's also not clear that when returning multiple outputs in Dataflow how to make the iteration lazy - that is, if a module returns several outputs (such as ReadFiles), the corresponding Dataflow block would return all the file documents at once and only then send them, one at a time, to the next module block.

Current thinking now is to implement a custom asynchronous pipeline. An internal class will wrap the module and provide BlockingCollection<IDocument> collections for both input and output. Add a new IAsyncModule interface that has an async Execute(...) method. The wrapper should lazily evaluate the enumerable returned from the module and add items to the BlockingCollection as they're available (at which point the next module will pick up and go). The satisfy the aggregate use case above, modules should be able to either block waiting for BlockingCollection.IsComplete or signal in some other way they need all the documents to be available before executing (perhaps with another interface?). Likewise, the wrapper should be sure to set IsComplete when enumeration of the results from the module is done.

from statiq.web.

daveaglick avatar daveaglick commented on May 22, 2024

After attempting to implement both asynchronous pipeline processing and then synchronous depth-first (by relying on lazy iteration), there are just too many compromises in both cases. In addition to the loss of easily understood sequential trace output, there are complications with ensuring full iteration, synchronizing metadata access (in the case of asynchronous processing), dealing with aggregate modules (as described above), dealing with modules like Branch and If that potentially require multiple iteration, etc.

Instead, I'd like to continue using the breadth-first synchronous processing model that was originally designed. That said, there certainly is a need to process documents one at a time for use cases like multiple large image processing. This will hopefully be the exception, so the support can be opt-in. I've created a new module, ForEach that should work in this situation by essentially running it's child module sequence multiple times, once for each input document (instead of the normal process of feeding all input documents to the next module at once). In the case of image processing, it should be used like this:

Pipelines.Add("ImageProcessing",
    // ReadFiles will create N new documents with a Stream (but nothing will be read into memory yet)
    ReadFiles("*")
        .Where(x => new []{".png", ".jpg", ".jpeg", "gif"}.Contains(Path.GetExtension(x))),
    // Each document in N will be individually sent through the sequence of ForEach child pipelines
    ForEach(
        // This will load the *current* document into a MemoryStream (?)
        ImageProcessor()
            .Resize(100,100).ApplyFilters(ImageFilter.GreyScale, ImageFilter.Comic)
            .And.Resize(60, null).Brighten(30)
            .And.Resize(0, 600).Darken(88)
            .And.Constrain(100,100),
        // and this will save the stream to disk, replacing it with a file stream,
        // thus freeing up the memory for the next file
        WriteFiles()
    )
);

from statiq.web.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.