ypares / porcupine Goto Github PK

Express parametrable, composable and portable data pipelines

Haskell 97.93% Shell 0.02% Nix 0.89% Starlark 1.16%

haskell analytics workflows reproducible-research

porcupine's Introduction

Porcupine is a tool aimed at people who want to express in Haskell general data manipulation and analysis tasks,

In a way that is agnostic from the source of the input data and from the destination of the end results,
So that a pipeline can be re-executed in a different environment and on different data without recompiling, by just a shift in its configuration,
While facilitating code reusability (any task can always be reused as part of a bigger pipeline).

Porcupine specifically targets teams containing skills ranging from those of data scientists to those of data/software engineers.

Resources

Porcupine GitHub pages, with an overview and tutorials
Introduction to porcupine @Haskell Exchange, in London, October 11th, 2019
Porcupine announcement blog post

Porcupine's development

Porcupine's development happens mainly inside NovaDiscovery's internal codebase, where a porcupine's fork resides. But we often synchronise this internal repo and porcupine's github repo. This is why commits tend to appear by batches on porcupine's github.

Lately, a lot of effort has been invested in developping Kernmantle which should provide the new task representation (see below in Future plans).

Participating to porcupine's development

Issues and MRs are welcome :)

Future plans

These features are being developed and should land soon:

porcupine-servant: a servant app can directly serve porcupine's pipelines as routes, and expose a single configuration for the whole server
enhancement of the API to run tasks: runPipelineTask would remain in place but be a tiny wrapper over a slightly lower-level API. This makes it easier to run pipelines in different contexts (like that of porcupine-servant)
common configuration representation: for now porcupine can only handle config via a yaml/json file + CLI. Some applications can require other configuration sources (GraphQL, several config files that override one another, etc). We want to have a common tree format that every configuration source get translated too, and just merge all these trees afterwards, so each config source is fully decoupled from the others and can be activated at will

The following are things we'd like to start working on:

switch to cas-store: porcupine's dependency on funflow is mainly for the purpose of caching. Now that cas-store is a separate project, porcupine can directly depend on it. This will simplify the implementation of PTask and make it easier to integrate PTasks with other libraries.
implement PTask over a Kernmantle Rope: this is the main reason we started the work on Kernmantle, so it could become a uniform pipeline API, independent of the effects the pipeline performs (caching, collecting options or required resources, etc). Both porcupine and funflow would become collections of Kernmantle effects and handlers, and would therefore be seamlessly interoperable. Developpers would also be able to add their own custom effects to a pipeline. This would probably mean the death of reader-soup, as the LocationAccessors could directly be embedded as Kernmatle effects.
package porcupine's VirtualTree as a separate package: all the code that is not strictly speaking related to tasks would be usable separately (for instance to be used in Kernmantle effects handlers).

F.A.Q.

How are Porcupine and Funflow related?

Porcupine uses Funflow internally to provide caching. Funflow's API is centered around the ArrowFlow class. PTask (porcupine's main computation unit) implements ArrowFlow too, so usual funflow operations are usable on PTasks too.

Aside from that, funflow and porcupine don't operate at the same level of abstraction: funflow is for software devs building applications the way they want, while porcupine is higher-level and more featureful, and targets software devs at the same time as modelers or data analysts. However, porcupine doesn't make any choice in terms of computation, visualization, etc. libraries or anything. That part is still up to the user.

The main goal of Porcupine is to be a tool to structure your app, a backbone that helps you kickstart e.g. a data pipeline/analytics application while keeping the boilerplate (config, I/O) to a minimum, while providing a common framework if you have code (tasks, serializing functions) to share between several applications of that type. But since the arrow and caching API is the same in both Funflow and Porcupine, as a software dev you can start by using porcupine, and if you realize you don't actually need the high level features (config, rebinding of inputs, logging, etc) then drop the dependency and transition to Funflow's level.

Can the tasks run in a distributed fashion?

Funflow provides a worker demon that the main pipeline can distribute docker-containerized tasks to. For pure Haskell functions, there is funflow-jobs but it's experimental.

So it could be used with funflow-jobs, but for now porcupine has only ever been used for parallel execution of tasks. We recently started thinking about how the funflow/porcupine's model could be adapted to run a pipeline in a cluster in a decentralized fashion, and we have some promising ideas so that feature may appear in the future.

Another solution (which is the one used by our client) is to use an external job queue (like celery) which starts porcupine pipeline instances. This is made easy by the fact that all the configuration of a pipeline instance is exposed by porcupine, and therefore can be set by the program that puts the jobs in the queue (as one JSON file).

I like the idea of tasks that automatically maintain and merge their requirements when they compose, but I want to deal with configuration, CLI and everything myself. Can I do that?

Of course! That means you would replace the call to runPipelineTask by custom code. You want to have a look at the splitTask lens. It will separate a task in its two components: its VirtualTree of requirements (which you can treat however you please, the goal being to turn it into a DataAccessTree) and a RunnableTask which you can feed to execRunnableTask once you have composed a DataAccessTree to feed it. Although note that this part of the API might change a bit in future versions.

Is Porcupine related to Hedgehog?

Can see where that comes from ^^, but nope, not all R.O.U.S.s are related. (And also, hedgehogs aren't rodents)

Although we do have a few tests using Hedgehog (and will possibly add more).

porcupine's People

Contributors

Stargazers

Watchers

Forkers

thufschmitt adlucem gitter-badger philderbeast saeedhk stewarthu circuithub karimamer typechecker mariszo

porcupine's Issues

Revise porcupine README.

Questions posted in #13.

Replace "repeatable" by "variable" (in VirtualFiles and maybe tasks)

VirtualFiles are called "repeatable" when their paths are to contain variables to be spliced in. It occured to me that this is actually more generic than passing around file indices.

For instance this could be used to determine dynamically the entire path of a file from the content of another file:

locations:
  /inputs/stuff: "{stuffPath}"

So "repeatable" is too specific. "Variable" would encompass better the uses.

Port tests / add new tests

For now our full test suite is not opensourceable as it uses data that is property of our client. The test suite in this repository for now is very light and should be expanded.

Improve composability of serials

For now, the repetition of VirtualFiles is done at the task level. This is a problem for two reasons:

It is not possible to take e.g. a PureDeserial A and generalize it to a PureDeserial (Stream (Of (Key, A)) m ()) (to read several A from the same source, which is necessary or at least very convenient if we want to use frames or cassava).
It is not possible to abstract out the way the As are laid out in the end in files. For instance, if each A is in a different JSON file, the code of the pipeline won't be exactly the same as if all the As are one-line json documents in the same file. porcupine has initially been envisioned to support that use case, and ideally only the pipeline config file should have to change between both cases.

Point 1. is difficult to address because currently the monad m type is hidden inside the serialization functions, and using a serial of Stream means exposing it. The problem is that serialization functions are supposed to be oblivious of the LocationMonad they're running in (they shouldn't care whether they read local or remote files).

Point 2. requires some thinking as to how we whould handle the repetition keys. In the case of "each bit of data is in a different file", it's simple. The repetition key suffixes the file name, et voilà, we don't care in which order the keys are read. It's not the same if every bit of data is just a line from a file, because now (unless we want to hog on to all the file in memory) we have to make some assumptions about the order in which the keys are present in the file, or if the keys are present at all. Plus, the ways keys will be laid out is serialization-dependent (for JSON one-liners for instance, then each line can contain the key as a field).

For now, the "simplest" way I can think of is to intermingle serials and tasks even more. For now serialization functions are plain functions. What if they were PTasks? This way, any serialization function could internally use any tool already available at the task level, for instance reusing sub VirtualFiles. This way, a SerialsFor would just work at the level of a subtree of the resource tree, and could access anything it wants in there. The problem is that this way, we'll have a somewhat "non-fixed" resource tree in the end, because the subtrees would depend on which serialization functions are actually chosen in the end, and that would make pipeline parameterization more complicated for the end user.

Generalize disambiguation of CLI arguments

( This might be unnecessary if #47 is tackled )

For now, the flag names for the CLI are generated by inspecting each DocRec independently of the others. That means that if two DocRec of options at two different points in the pipeline both expose a field with the same name, only the first one will be exposed via CLI (the second is still settable in the config file, though).

There is a disambiguation method, but currently it works only at the level of one DocRec (if two fields, named with their two "paths" p1 and p2, are so that last p1 == last p2). This feature is rarely used because few tasks make use of the hierarchical nature of DocRec names (most use only one-level namings). We should extend that disambiguation process to the whole pipeline.

Adding extra " " to strings indices in file names

Up to now, we considered numbers as our indices. Then we use show function to add them to the file names. The issue is that if we use the strings as the indices, we don't have to apply the show function to obtain an extra " ".

monad-bayes is now on Hackage

https://github.com/tweag/porcupine/blob/bb3613958af6c4a034c95aef60df9842a7a82d31/porcupine-core/package.yaml#L85-L88

This is out of date: monad-bayes is now on Hackage.

Investigate other exchange formats for config files

For now, our config file can be in JSON/YAML which the pipeline can automatically generate.
That is nice, but porcupine (due to its inclusive philosophy) could attract other people by supporting other formats, I'm thinking about:

TOML (would be easy to add, and nice for pipelines with light configuration. Although it wouldn't fit workflows where we want to embed arbitrary input data in the config file --see #47 for some details--, because TOML is too flat. So it would require some thinking)
Avro/Thrift/Protobuf (would require more work, but given the virtual tree contains all the information about types it is already possible to do so, and it would really enhance a pipeline's capacity to be called from an external tool)
Apache Arrow (related to #9, this could be useful for pipelines using a lot of data which could actually be packed in one big parquet/arrow dataset)

Auto-document the config file

Porcupine-based executables are able to generate a default template, but this template can't include any comment which would be very nice to make the pipelines self-documented

Cannot build reader-soup on GHC 8.10

Describe the bug

I cannot build reader-soup on GHC 8.10.

To Reproduce

cabal build reader-soup on GHC 8.10.

Expected behavior

It builds

Additional context

The failure is:

[1 of 4] Compiling Control.Monad.ReaderSoup ( src/Control/Monad/ReaderSoup.hs, dist/build/Control/Monad/ReaderSoup.o, dist/build/Control/Monad/ReaderSoup.dyn_o )

src/Control/Monad/ReaderSoup.hs:127:17: error:
    • Expected kind ‘(k -> *) -> [k] -> *’,
        but ‘r’ has kind ‘((Symbol, *) -> *) -> [(Symbol, *)] -> *’
    • In the first argument of ‘RecElemFCtx’, namely ‘r’
      In the type ‘(HasField r l ctxs ctxs (ContextFromName l) (ContextFromName l),
                    RecElemFCtx r ElField)’
      In the type declaration for ‘IsInSoup_’
    |
127 |   , RecElemFCtx r ElField )

Perhaps this is a problem with vinyl, which does at least build on 8.10, but may have things incorrectly typed?

Improve the error message in case of invalid locations

If my yaml config for a pipeline (running with the http backend) contains something like

locations: 
  /Foo: http://foo:thisIsntAValidPortNumber/bar

then porcupine will fail with a message of the form

Location(s) "http://foo:thisIsntAValidPortNumber/bar" cannot be used by the location accessors in place.

While true, this isn't a really cool message to get because it doesn't tell me why the http backend couldn't recognize this url (and my first intuition as a user will be to think that the dev of the pipeline forgot to include the http backend).

I see two solutions to avoid this:

Display the error messages from all the parsers

This is probably the simplest (though not the nicest) solution. It would look something like

Location(s) "http://foo:thisIsntAValidPortNumber/bar" cannot be used by the location accessors in place:
  backend "local" failed with: expected "/" but got "h"
  backend "http" failed with: expected digit but got "t"
  backend "s3" failed with: expected digit but got "t"

(I just made up the actual error messages, these would probable be different and depend on the actual parser used by the backend)

Have a way to register a certain backend depending on some part of the location (like the protocol field in an url or a specific field in a json object). For example every url with http as the protocol would be reserved by the http backend. In this case the final error message would look like
```
Location(s) "http://foo:thisIsntAValidPortNumber/bar" look like an http location but the http parser failed with: expected digit but got "t"
```
The advantage of this solution is that we would have a really different error message, like
```
no registered location accessor for http urls
```
Obviously this gives less flexibility to the location accessors, so we should evaluate whether the trade-off is worth it

Repeated VirtualFiles and embedded data don't play well together

For now, if an optionsVirtualFile is repeated (e.g if it is accessed in a task repeated via FoldA or parMapTask), the options will appear only once in the config file. What we would like instead is to have that field in the config file be a JSON object, where the record of options will be also repeated and indexed by the TRIndex value.

This should also work for any embeddable VirtualFile (those who support serialization/deserialization to/from JSON). For instance, in example1 this would allow to directly embed the users data in the config file.

Add a README to docrecords

Data.DocRecords suggests that the library exists to add documentation to fields of extensible records. The packge.yaml file says instead that it adds hierarchical field names.

In either case, there's no hint what the purpose of these features are, neither what a hierarchical field name is.

Rewrite serials in terms of `Stream (Of X) IO`, and JSON parsing with `streaming-utils` (as much as possible)

To augment coherence, now that streaming-utils has been upgraded we should rewrite notably the HTTP accessor to use it. The stream json API should also be used in the serials.

HTTP address reading

We've found that some HTTP addresses are not read as it should. For example, the following
https://api.iextrading.com/1.0/stock/aapl/batch?types=chart&range=1y
is not read correctly and just call the following address instead:
https://api.iextrading.com/1.0/stock/aapl/batch

A fresh clone doesn't build with latest stack, kqueue needed.

Describe the bug
A fresh clone doesn't build with latest stack.

To Reproduce

> git clone git clone https://github.com/tweag/porcupine.git
> cd porcupine
> stack build
Error: While constructing the build plan, the following exceptions were encountered:

In the dependencies for funflow-1.5.0:
    kqueue needed, but the stack configuration has no specified version  (latest matching version is 0.2)
needed due to porcupine-core-0.1.0.1 -> funflow-1.5.0

Some different approaches to resolving this:

  * Recommended action: try adding the following to your extra-deps in /Users/.../porcupine/stack.yaml:

- kqueue-0.2@sha256:47d6c1083e60d55a64511c944629a507bda70906e5317bb6f906d64b2cb78a9f,1369

Environment

OS name + version:

> sw_vers
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G13035

Version of the code:
2eb9c4f
Version of stack:

> stack --version
Version 2.3.1, Git revision de2a7b694f07de7e6cf17f8c92338c16286b2878 (8103 commits) x86_64 hpack-0.33.0

Profunctor combinators

These are lens-like combinators that could be used for manipulating arrows in funflow and porcupine. They are based on the code in the profunctor-optics library.

The definition of an Optic in this library is:

type Optic p s t a b = p a b -> p s t

Since these optics are functions, they can be composed with the (.) operator.

Lens

The definition of a Lens in the library is:

type Lens s t a b = forall p. Strong p => Optic p s t a b

which means that the two functions that compose the Strong class can be considered as lenses:

first' :: forall p. Strong p => p a b -> p (a, c) (b, c)
first' :: Lens (a,c) (b,c) a b

second' :: forall p. Strong p => p a b -> p (c, a) (c, b) 
second' :: Lens (c,a) (c,b) a b

Other lenses can be constructed using the general lens function.

Prism

The definition of a Prism in the library is:

type Prism s t a b = forall p. Prism p => Optic p s t a b

which means that the two functions that compose the Choice class can be considered as prisms:

left' :: forall p. Choice p => p a b -> p (Either a c) (Either b c)
left' :: Prism (Either a c) (Either b c) a b

right' :: forall p. Choice p => p a b -> p (Either c a) (Either c b) 
right' :: Prism (Either c a) (Either c b) a b

Other prisms can be constructed using the general prism function.

Traversal

The definition of a Traversal in the library is different from what I would expect. For a good reason I imagine. But if we keep in the same track as in the other optics, we can take inspiration from the Traversing class, and define for ourselves the following:

type Traversable s t a b = forall p. Traversing p => Optic p s t a b

which would mean that the base function from the Traversing class defines a traversal:

traverse' :: Traversable f => p a b -> p (f a) (f b) 
traverse' :: Traversal (f a) (f b) a b

Other traversals can be constructed using the wander function, that belongs to the same class:

wander :: (forall f. Applicative f => (a -> f b) -> s -> f t) -> Traversal s t a b

Uniformize the resource tree and the record of options types

(This is a more general, cleaner but longer to implement way to solve #6 , so we should tackle one or the other)

Currently, we have a discrepancy between the resource tree and the records of options.
We use a simple recursive-hashmap-based datatype for the resource tree, where some nodes contain a (possibly recursive) record of options which itself is statically typed through vinyl, but has to be existentially quantified to be stored in the resource tree. So we don't really benefit from static typing here. The only requirements we have for records of options are:

They should be convertible to/from JSON
Each field (node) should be able to hold arbitrary default values (as long as they are also convertible to/from json)
Each field (node) should be documented, so --help (and possible future self-documentation features) can work
Each field (node) should contain a "source" field (where they come from, like Default, YAML, CLI etc), at least at parsing time, so we know which exact fields have been updated from CLI and therefore which fields we should override from default and from the config)

Basically all these requirements are addressed now by the fact that vinyl records are a HKD (ie. every field can be wrapped in a functor or composition of function, and you can "unwrap" these functor layers one after another, or transform them into other).

But as we said, we don't need to statically encode the types of each field (as we won't benefit from that type safety anyway). Only the functors that wrap them would be enough.

So let's come back to the first requirement, convertibility to/from JSON. There's one type acting as the GCD between all types with that property, it's aeson's Value. So what if instead of Value, we had a HKD version of it, which could directly exhibit all the properties we want without having to have a separate record type (and which would be convertible to/from plain old Value, to ensure interop)?

We'd therefore have this type:

data HKValue f
  = Object (f (HashMap Text (HKValue f)))
  | Array (f (Vector (HKValue f)))
  | String (f Text)
  | Number (f Scientific)
  | Bool (f Bool)
  | External (f ())

Note we replaced Null by another constructor (External for the sake of having a name), which allows to have just the functor layer without a real value in it. (If f==Identity, it's just homomorphic to Null)

And once we have it, we notice something: we can perfectly use that type to represent our resource tree itself. That would add another requirement to be addressed by another functorial layer in f, though:

Each node should be bindable to external resource(s) (files, etc)

This could allow to transform much of the specific code we have right now (mappings, creation of the resource trees etc) by calls to some hmap, sequence or zip-like generic functions, if we adopt HKValue as a common framework throughout the code. This code could and should even be part of an external library, because it wouldn't be porcupine-specific.

Improve overriding of options

For now, the overriding of options goes like this:

Yaml file data section << Command line params << Physical files (both declared in locations and added with --loc)

This might not be intuitive. Let's say we have a record with a field called "fieldA", and located under "/Settings" in the resource tree. If a settings.json file is mapped to /Settings, and the user sets --fieldA stuff on the CLI, then this value (which would look like the definitive value) will be overriden by the value read in settings.json, even if intuitively the CLI options should override everything else.

Also if I say --loc /Settings+=settings.json on the command line, then I want whatever option read from settings.json to override all the rest, even if the final option is obtained from a file (and not directly from the CLI).

This means the exact source of each field should be tracked, and that the merge of the docrecords of options should be more intelligent.

Add CI badge to README

Add it just below README title.

docrecords-0.1.0.0 fails to build with aeson-2

src/Data/DocRecord.hs:322:67: error:
    • Couldn't match type: Data.Aeson.KeyMap.KeyMap Value
                     with: HM.HashMap T.Text Value
      Expected: HM.HashMap T.Text Value
        Actual: Object
    • In the second argument of ‘HM.lookup’, namely ‘o’
      In the second argument of ‘($)’, namely ‘HM.lookup p o’
      In the expression: jsonAtPath ps f $ HM.lookup p o
    |
322 |       Just (Object o) -> (o,        jsonAtPath ps f $ HM.lookup p o)
    |                                                                   ^

src/Data/DocRecord.hs:323:27: error:
    • Couldn't match type: HM.HashMap k0 v0
                     with: Data.Aeson.KeyMap.KeyMap Value
      Expected: Object
        Actual: HM.HashMap k0 v0
    • In the expression: HM.empty
      In the expression:
        (HM.empty, jsonAtPath ps f $ Just $ Object HM.empty)
      In a case alternative:
          _ -> (HM.empty, jsonAtPath ps f $ Just $ Object HM.empty)
    |
323 |       _               -> (HM.empty, jsonAtPath ps f $ Just $ Object HM.empty)
    |                           ^^^^^^^^

src/Data/DocRecord.hs:323:69: error:
    • Couldn't match type: HM.HashMap k1 v1
                     with: Data.Aeson.KeyMap.KeyMap Value
      Expected: Object
        Actual: HM.HashMap k1 v1
    • In the first argument of ‘Object’, namely ‘HM.empty’
      In the second argument of ‘($)’, namely ‘Object HM.empty’
      In the second argument of ‘($)’, namely ‘Just $ Object HM.empty’
    |
323 |       _               -> (HM.empty, jsonAtPath ps f $ Just $ Object HM.empty)
    |                                                                     ^^^^^^^^

src/Data/DocRecord.hs:324:40: error:
    • Couldn't match type: HM.HashMap T.Text v2
                     with: Data.Aeson.KeyMap.KeyMap Value
      Expected: Object
        Actual: HM.HashMap T.Text v2
    • In the second argument of ‘($)’, namely ‘HM.delete p obj’
      In the second argument of ‘($)’, namely ‘Object $ HM.delete p obj’
      In the expression: Just $ Object $ HM.delete p obj
    |
324 |     rebuild Nothing  = Just $ Object $ HM.delete p obj
    |                                        ^^^^^^^^^^^^^^^

src/Data/DocRecord.hs:324:52: error:
    • Couldn't match type: Data.Aeson.KeyMap.KeyMap Value
                     with: HM.HashMap T.Text v2
      Expected: HM.HashMap T.Text v2
        Actual: Object
    • In the second argument of ‘HM.delete’, namely ‘obj’
      In the second argument of ‘($)’, namely ‘HM.delete p obj’
      In the second argument of ‘($)’, namely ‘Object $ HM.delete p obj’
    |
324 |     rebuild Nothing  = Just $ Object $ HM.delete p obj
    |                                                    ^^^

src/Data/DocRecord.hs:325:40: error:
    • Couldn't match type: HM.HashMap T.Text Value
                     with: Data.Aeson.KeyMap.KeyMap Value
      Expected: Object
        Actual: HM.HashMap T.Text Value
    • In the second argument of ‘($)’, namely ‘HM.insert p v obj’
      In the second argument of ‘($)’, namely
        ‘Object $ HM.insert p v obj’
      In the expression: Just $ Object $ HM.insert p v obj
    |
325 |     rebuild (Just v) = Just $ Object $ HM.insert p v obj
    |                                        ^^^^^^^^^^^^^^^^^

src/Data/DocRecord.hs:325:54: error:
    • Couldn't match type: Data.Aeson.KeyMap.KeyMap Value
                     with: HM.HashMap T.Text Value
      Expected: HM.HashMap T.Text Value
        Actual: Object
    • In the third argument of ‘HM.insert’, namely ‘obj’
      In the second argument of ‘($)’, namely ‘HM.insert p v obj’
      In the second argument of ‘($)’, namely
        ‘Object $ HM.insert p v obj’
    |
325 |     rebuild (Just v) = Just $ Object $ HM.insert p v obj
    |                                                      ^^^
cabal: Failed to build docrecords-0.1.0.0.

Github pages: 404

The githubpages documentation as linked in the README.md returns a 404.

(Also, it's been 4 years since the last update. Is this project still in active development? Maintenance mode?)

Improve radon example

Like evoked in #64 , example-radon could be improved:

generate more descriptive visualisations of estimated parameters and uncertainty (eg with KDE plots, like in twiecki's blogposts)
add the actual hierarchical model (cc @MMesch ) to compare it to flat linear reg
add comments
expose other possible sampling options

Improve integration with Haskell's and other ecosystems

For now, porcupine only provides connectors with very simple datatypes (CSV and JSON). I think in order to enhance integration of other tools we should aim at the following goals:

Providing one-function read of hmatrix data (via CSV)
Reading and writing frames (https://hackage.haskell.org/package/Frames)
Thinking about how Apache Arrow could be integrated in that ecosystem. Haskell support for Arrow is basically nonexistent, but, given one one the goals of porcupine down the road is to provide easy external tasks (ideally, writing in place a python/R script as a task), it could really enhance interoperability with Python/R ecosystems to have a standard in-memory representation that could be passed between different runtimes.

Arrow notation plays badly with GADTs ("the impossible happened" on GHC 8.6.5)

This issue is due to a GHC bug, most likely something similar to https://gitlab.haskell.org/ghc/ghc/issues/16887.

It seems that on some occasions, if you try to pattern match inside arrow notation on a record (with FV for instance) you get:

ghc: panic! (the 'impossible' happened)
  (GHC version 8.6.5 for x86_64-unknown-linux):
	StgCmmEnv: variable not found
  $dShow_aiZk
  local binds for:
  $trModule
  $trModule3
  $trModule1
  ...

This is similar to what example0 does, except in this case, it doesn't cause any problem.

I'll try to cook up a minimal repro that doesn't use porcupine.

Build failure: `showDocumentation` doctest out of date

0b22731 changed the signature of showDocumentation but did not update the doctest just above the function definition, thus introducing this build failure: https://circleci.com/gh/tweag/porcupine/496

This can be fixed by adding an additional parameter to the doctest invocation of showDocumentation.

Support boundless streams of inputs/outputs

For now, VirtualFiles can either be unique or repeated (and indexed, in which case each we read it as a stream), but we cannot really manipulate unbounded streams of data, where the concept of index has no meaning, because you have no control over the order in which data arrives.

To summarize my thinking, I think external data can exist in three repetition modes:

Statically-indexed: the number of occurences and their paths are known in advance. In porcupine you would handle that with a VirtualFile which would appear several times in your VirtualTree with a different virtual path (e.g. with ptaskInSubtree), or with layers if the data read from these files is a Semigroup.
Dynamically-indexed: the number of occurences and their paths is known at the execution of the program (eg. because we compute these paths from a list of indices obtained either from CLI options or from another file). In porcupine you would handle that either with layers (if data is Semigroup), which doesn't put constraints on these files' paths, or with repeated virtual files (loadDataStream, parMapTask or FoldA), which doesn't put any constraint on your data but puts one on the files' path (which now have to be the same up to some index).
Unindexed: the number of occurences or their indices cannot be known at all, possibly because they don't even have any index, e.g if we read an unbounded stream of data (and therefore might need to generate an unbounded stream of outputs as a result). We need to read the data until we have no more data, and we can't know when that will be in advance. Currently you cannot handle that case in porcupine.

For the communication specifics, we could resort on existing standards, like (no surprises) ... Apache Arrow! See https://arrow.apache.org/docs/format/Flight.html (based on gRPC).
But ideally we'd like to support various backends (start an HTTP server to receive the stream, thrift/avro streams, etc). So possibly that'd mean adding a StreamAccessor next to LocationAccessor.

Can't put an empty path in a http location

In a pipeline with the http location accessor, setting

locations:
  /Foo: http://foo

fails with Variable(s) [] in '' haven't been given a value

Otoh,

locations:
  /Foo: http://foo/bar

works fine

Broken link in the docs

The link in "Let's have a look at example1" leads to a 404.

Idea: work together on an example for hasktorch

Hi @YPares, @austinvhuang mentioned tweag and your name on the https://github.com/hasktorch slack channel recently, and I got curious and had a look at tweag's open source haskell libraries and promptly found porcupine. It appears to me that hasktorch and porcupine could be a good fit when it comes to building an end-to-end machine learning pipeline, starting from preprocessing raw inputs to tensors, via training or running a model, to postprocessing tensors into the desired output format.
In order to test this hypothesis, I'd like to propose a small collaboration in which we together build a small but complete example of such an application. An ideal candidate would be solving MNIST, see, for instance, https://github.com/goldsborough/examples/tree/cpp/cpp/mnist. We've already started working on a toy model implementation, see hasktorch/hasktorch#196, but it would be great if we could pair this with an example of a principled data pipeline.
Let us know if you are interested,
Cheers,
Torsten

`writeBSS` for http locations returns success when it souldn't

writeBSS for an http location (from porcupine-http) will (at least in some cases) return successfully regardless of the answer of the server.

Steps to reproduce

Start a test server like python3 -m http.server
Run a pipeline which will try to write at http://localhost:8000/foo/bar

The server will reply with a 501 error code, but porcupine won't notice the failure and keep-going as if it was a successful write

Rework layout of github pages

For now, the https://tweag.github.io/porcupine/ pages show just one big page. It should be better laid out as separate pages for doc, tutorials, etc.