Comments (3)
So once again, I'm back to wondering if I should just go ahead and implement parallelism for each pipeline. This was considered in #42 and now it's come up again, less than a week later. Thought process is that if we're going to support depth-first traversal, we're already going confuse the tracing output. And it probably makes sense just to do depth-first always for consistency and to minimize confusion. At that point, there's really nothing holding us back from just making the whole thing async.
Currently investigating the TPL Dataflow library for this:
https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx
http://www.michaelfcollins3.me/blog/2013/07/18/introduction-to-the-tpl-dataflow-framework.html
One big change (regardless of if we go async or jsync depth-first) would be that modules can no longer access the set of documents from their own pipeline. For example, if a blog post accesses other blog posts to display the next and previous ones, that wouldn't work because the first post would get all the way to the end of the pipeline before the second post is processed. Mitigation would be to read enough metadata for all posts in one pipeline, then continue processing in a second pipeline for layout. The switch in pipelines would act as a chokepoint, letting all posts get processed for metadata before continuing to layout and output.
from statiq.web.
Another problem: how to deal with modules that operate on all input documents as a unit? I.e., a (hypothetical) Aggregate
module that combines all input documents into a single output document (come to think of it, should probably make this module)?
In Dataflow, there's no multiple-input, multiple-output - each block operates on a single input, regardless of the number of outputs. It's also not clear that when returning multiple outputs in Dataflow how to make the iteration lazy - that is, if a module returns several outputs (such as ReadFiles
), the corresponding Dataflow block would return all the file documents at once and only then send them, one at a time, to the next module block.
Current thinking now is to implement a custom asynchronous pipeline. An internal class will wrap the module and provide BlockingCollection<IDocument>
collections for both input and output. Add a new IAsyncModule
interface that has an async Execute(...)
method. The wrapper should lazily evaluate the enumerable returned from the module and add items to the BlockingCollection
as they're available (at which point the next module will pick up and go). The satisfy the aggregate use case above, modules should be able to either block waiting for BlockingCollection.IsComplete
or signal in some other way they need all the documents to be available before executing (perhaps with another interface?). Likewise, the wrapper should be sure to set IsComplete
when enumeration of the results from the module is done.
from statiq.web.
After attempting to implement both asynchronous pipeline processing and then synchronous depth-first (by relying on lazy iteration), there are just too many compromises in both cases. In addition to the loss of easily understood sequential trace output, there are complications with ensuring full iteration, synchronizing metadata access (in the case of asynchronous processing), dealing with aggregate modules (as described above), dealing with modules like Branch
and If
that potentially require multiple iteration, etc.
Instead, I'd like to continue using the breadth-first synchronous processing model that was originally designed. That said, there certainly is a need to process documents one at a time for use cases like multiple large image processing. This will hopefully be the exception, so the support can be opt-in. I've created a new module, ForEach
that should work in this situation by essentially running it's child module sequence multiple times, once for each input document (instead of the normal process of feeding all input documents to the next module at once). In the case of image processing, it should be used like this:
Pipelines.Add("ImageProcessing",
// ReadFiles will create N new documents with a Stream (but nothing will be read into memory yet)
ReadFiles("*")
.Where(x => new []{".png", ".jpg", ".jpeg", "gif"}.Contains(Path.GetExtension(x))),
// Each document in N will be individually sent through the sequence of ForEach child pipelines
ForEach(
// This will load the *current* document into a MemoryStream (?)
ImageProcessor()
.Resize(100,100).ApplyFilters(ImageFilter.GreyScale, ImageFilter.Comic)
.And.Resize(60, null).Brighten(30)
.And.Resize(0, 600).Darken(88)
.And.Constrain(100,100),
// and this will save the stream to disk, replacing it with a file stream,
// thus freeing up the memory for the next file
WriteFiles()
)
);
from statiq.web.
Related Issues (20)
- Release 1.0.0-beta.50 is broken with Search Index Generation enabled HOT 3
- Netlify deployment not deploying dot files/directories HOT 7
- Post is published even if the published date is only tomorrow HOT 5
- Statiq.Web.props to Statiq.Web.targets causing compilation issues HOT 2
- How to set FrontMatterRegexes to use custom Frontmatter? HOT 3
- Sitemap is missing lastmod values HOT 4
- LESS: 'background: center/xx' breaks the compilation HOT 4
- Strict run and add document metadata in runtime HOT 1
- RSS missing feed title HOT 4
- Question: Include example HOT 2
- Microsoft.Azure.Search is Deprecated, and has known vulnerabilities
- install pptx HOT 2
- [Bug] Images don't show on post listing HOT 2
- SCSS regeneration is not working when changing imported files HOT 4
- Stack overflow. Repeat 611 times: Netlify deploy HOT 3
- Make sure IsPost gets set correctly for tags and archive pages HOT 1
- Add minification modules as options
- Analyzer warning when using Unicode Tags. HOT 2
- Link validation appears to be flagging mailto links HOT 4
- Hot reload stops working after a compiler error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from statiq.web.