jacg / liquidata Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 933 KB

EDSL for data pipelines in Python

Home Page: https://jacg.github.io/liquidata/

License: MIT License

Python 100.00%

liquidata's People

Contributors

Watchers

Forkers

gonzaponte mmkekic jmalbos

liquidata's Issues

Improve error message in `map`

Pick a name for the project

The name 'dataflow' comes up with lots of hits on PyPI. Before uploading to PyPI we need to come up with a new, distinguishing and not too ugly name for this package. Take inspiration from ideas such as

easy composition of small orthogonal functions
data flow / streams / pipes / networks / graphs
discouraging Physicists' typical 2000-line functions
improvement of signal-to-noise ratio and clarity of code
'liquid' for historical IC reasons
DSL for visual programming
alternative way of implementing stuff that might be written in (nested) loops

Installation and set-up of the package are not described

This is just a reminder.
We need to specify all the dependencies for each way.

The pipe parameter of push should accept implicit pipes

Currently we have to say

df.push(pipe = df.pipe(A, B, C), ...)

It should be possible to say

df.push(pipe = (A, B, C), ...)

instead.

In other words, the implicit pipe mechanism which already exists withn fork and branch should also be usable here.

Address TODOs from original commits

sum
dispatch
merge
eliminate finally-boilerplate from RESULT (with contextlib.contextmanager?)
graph structure DSL (mostly done: pipe, fork, branch (dispatch))
network visualization

Alternative architectures

Why do we have the current architecture?

The current implementation uses coroutines to push data through the pipeline.

This was motivated by the context in which the ancestor of liquidata was written, where a single input stream was required be split into multiple independent output streams fed into separate sinks.

Let's take a step back and ask some questions about this choice: Do we need to push? Do we need coroutines? Why? When? What are the consequences? What alternatives are there?

Function composition

Let's consider a very simple pipeline, consisting of a linear sequence of maps:

pipe(f, g, h)

This could be implemented using any of

coroutines
generators
asyncio
function composition

Function composition is the simplest, so why bother with the others?

Changing stream length

Let's throw in a filter or a join:

pipe(f, { p }, g, h)
pipe(f, join, g, h)

Function composition no longer cuts the mustard, because there is no longer a 1-to-1 correspendence between items in the input and output streams: something is needed to shrink or expand the stream.

Stream bifurcation

A different complication, branches:

pipe(f, [x, y, z], g, h)

It's difficult to see how function composition and generators could deal with this.

Joining streams

That last pipe describes a graph that looks something like this:

             x -- y -- z
           /
source -- f -- g -- h

How about

sourceA -- a 
            \
             g --- h
            /
sourceB -- b

Generators can deal with this easily:

map(h, map(g, map(a, sourceA)
              map(b, sourceB)))

but it's not obvious how this would work for function composition or coroutines.

`liquidata` syntax for multiple sources

What would the liquidata syntax for this look like?

Ideally we'd have another kind of bracket (we've exhausted the possibilities that Python offers: (), [], {}). Let's imagine that <> are valid syntactic brackets:

pipe(b, <sourceA, a>, g, h)  # join two sources
pipe(b, [a, out.A],   g, h)  # branch out into two sinks

Working with syntax available in Python, how about:

pipe(b, source(sourceA, a), g, h)

Recall that the following are already synonymous in liquidata

pipe(f)(data)
pipe(source << data, f)
pipe(data >> source, f)
pipe(source(data), f)

so the following could work

pipe(source << sourceB, b, (source << sourceA , a), g, h)

liquidata used to have a input-variable syntax (called slots) in its earlier prototype. If something like it were resurrected, we could write something along the lines of

fn = pipe(source << slot.B, b, (source << slot.A, a), g, h)
fn(A=sourceA, B=sourceB)

Synchronization

In liquidata []-branches are called independent, because there is absolutely no synchronization between them (in contrast with named branches which are based on namespaces flowing through the pipe and managed with name, get and put). Once the stream has split, the branches can change the length of the stream without the others knowing or caring about it.

We would need to think about the synchronization consequences for multiple independent input streams. I guess that the answer is: it's up to the user to make sure something sensible happens.

`close_all`

Consider the following in current liquidata

pipe(source << itertools.count(), take(3))
pipe(source << itertools.count(), take(3, close_all=True))

Because of the pull-architecture, the first never returns. In a pull architecture the issue doesn't arise.

So what?

I would like to remove the universal reliance on coroutines, with two main goals

Enabling things that were impossible before.
Simplifying portions of the code which don't need coroutines.

Name clashes in output namespace

Currently, name clashes in the output namespace are resolved by creating a tuple of all the results that clash. This is not The Right Thing. Think about what should be done instead.

#18 might be part of the solution, or it might not.

Improve error messages

It is a major goal of the project to issue excellent error messages.

No work has been done on this yet.

The way forward is probably to separate the compilation of the network into coroutines into separate phases including

type/error checking
coroutine generation

Some optimization phases should be added eventually too.

This will tie in with other work on alternative architectures (push vs pull, pure composition, etc.).

Hierarchical outputs from branches

Currently we have

pipe([out.Y], out.X) -> Namespace(Y=<branch output>, X=<main output>)

which is great, but then

pipe([out.X], out.X) -> Namespace(X=(<branch output>, <main output>))

which is not so good.

You might say "so don't use the same output name twice in the same network!" but that's easier said than done:

abstraction = [out.X]
pipe(abstraction, out.X)

Might these issues be mitigated by enabling hierarchical naming for branches?

Maybe something like

pipe(out.B([out.X]), out.X) -> Namespace(X=<main out>, B=Namespace(X=<branch out>)

This might form part of a solution to #17.

All parameters of push should be keyword-only

Improve test suite

There are a bunch of TODOs left from the first version of the code related to the testing suite that shall be addressed in upcoming PRs

Test implicit pipes in fork
fork accepts pipes as arguments, but tuples can be interpreted as pipes and converted implicitly. This feature is not tested.
Test count_filter
This component is not tested.
Test spy_count
This component is not tested.
(Test polymorphic result of pipe)
Not sure this is this still relevant. Are we allowing sinkless pipes?
Add test for failure to close sideways in branch
Branches produce two downstreams in the data flow and both must be closed when data stops flowing. A test should ensure this behavior.
Test string_to_pick and its usage
If the first element in a pipe is a string the data flowing through that pipe is reduced to a specific item. This functionality needs to be tested.
Add test for slice with close_all = True
slice takes an optional argument close_all to decide whether to stop the full pipeline when it runs out of values. The functionality for close_all = True is not tested.

Reorganize out and into

Current situation

Currently out can be used in the following ways

out (shorthand for out(into(list)))
out(<binary-fold-function>)
out(<binary-fold-function>, <initial-value>)
out(into(<consumer-of-iterables>))
All of the above wth .<name> appended to out.
In the absence of out at the end of a pipe, it defaults to out(into(list))

Proposal 1: extract `into` from `out`

the situation becomes

out (shorthand for into(list))
out(<binary-fold-function>)
out(<binary-fold-function>, <initial-value>)
into(<consumer-of-iterables>)
All of the above wth .<name> appended to out or into.
In the absence of out or into at the end of a pipe, it defaults to into(list)

Pros

This removes the overhead of having to wrap every use of into in an out , by shortening out(into(thing)) -> into(thing).

Cons

Previously, every return value (except implicit ones at the end, or ones hidden inside abstractions) were marked with an out. Now they would be marked by out or into.
More generally, the universal nature of out is lost. If we ever invent more outputs (e.g. #18) then the lack of a universal out might require more ad-hoc implementation, and avoiding this is a core design principle of liquidata.

Proposal 2: juggle the names

	fold	consume iterable
Current situation	`out`	`out(into ...)`
Proposal 1	`out`	`into`
Option A	`fold`	`into`
Option B	`fold`	`out`

Improve error messages

The error messages that appear when something goes wrong in a pipeline, are often difficult to understand.

We should try to detect as many problems as possible at the time graphs are constructed, rather than waiting for things to go pear-shaped when data are sent through the pipe.

Let's collect examples of unhelpful messages in the comments here.

Reverse key-filter arguments?

Currently, the order is { predicate : key }. Would it make more sense to have { key : predicate }?

Constant-space into

The current implementation of into is a temporary hack, consequently it is O(N) rather than O(1) in space.

This needs fixing pretty urgently. It is complicated by the push architecture. In a pull architecture it would be trivial.

Change interface for selection of items in dict-streams: pipe decorators

At present, the signature of map is

def map(op=None, *, args=None, out=None, item=None):

The parameters args, out and item are there for the purpose of working with streams which contain dictionaries, in order to select which arguments are passed to the operation, and what to do with its output. map is not the only utility which has such parameters, and it is likely that new utilities might need similar parameters in the future.

Currently, such parameters are interpreted, usually with repetitive, and sometimes convoluted, logic in each such utility itself. Consequently, the same problem is solved repeatedly, and needs to be tested repeatedly for each component that supports item selection.

This is horribly wrong! We should provide utilities whose one and only job is to provide item selection or item setting capability to any pipe, and ensure that they can be applied to utilities such as map (which should be blissfully ignorant of the item-fiddling) in order to achieve the desired effect. In other words, we should apply the philosophy of writing small, testable, reusable, composable units, to this particular feature!

What would it look like?

How about something along the following lines?

def xxx(pipe, in_=None, out=None, on=None, args=False, kwds=False):

in_
- None: xxx forwards incoming data to pipe
- a single string: xxx picks one item off the dict in the stream and forwards it as sole input to pipe
- a tuple of strings: xxx picks those items out of the dict and forwards them to pipe
out
- None: send the output of pipe downstream
- a single string: attach the output of pipe on to the dictionary that xxx received
- a tuple of strings: this must match the length of the iterable returned by pipe; attach the returned values to the dict that xxx received, under the names in the corresponding positions
on=yyy: shorthand for in_=yyy, out=yyy
args: make pipe apply *args expansion to whatever it receives
kwds: make pipe apply **kwds expansion to whatever it receives

Alternatively, perhaps these can be implemented as 5 separate utilities.

In any case, this idea of 'pipe decorators' will lead to much more composable and reusable code.

Context managers

fn = pipe(open, join, sink(print))

is the equivalent of

def fn(names):
    for name in names:
        for line in open(name):
            print(f)

But how about

def fn(names):
    for name in names:
        with open(name) as file:   # this line was added
            for line in file:
                print(f)

A with-equivalent is needed.

Perhaps, more importantly than side-effects, we need to consider things like

fn = pipe(<with-syntax> open, join, process, { filter }, out(into(collector)))

fn = pipe(<with-syntax> (open, join, process, { filter }), more, stuff, out(into(collector)))

Note that with needs to be aware of where the subpipe to which it applies, ends.