Giter VIP home page Giter VIP logo

liquidata's People

Contributors

gonzaponte avatar jacg avatar jahernando avatar mmkekic avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

liquidata's Issues

Pick a name for the project

The name 'dataflow' comes up with lots of hits on PyPI. Before uploading to PyPI we need to come up with a new, distinguishing and not too ugly name for this package. Take inspiration from ideas such as

  • easy composition of small orthogonal functions
  • data flow / streams / pipes / networks / graphs
  • discouraging Physicists' typical 2000-line functions
  • improvement of signal-to-noise ratio and clarity of code
  • 'liquid' for historical IC reasons
  • DSL for visual programming
  • alternative way of implementing stuff that might be written in (nested) loops

The pipe parameter of push should accept implicit pipes

Currently we have to say

df.push(pipe = df.pipe(A, B, C), ...)

It should be possible to say

df.push(pipe = (A, B, C), ...)

instead.

In other words, the implicit pipe mechanism which already exists withn fork and branch should also be usable here.

Address TODOs from original commits

  • sum
  • dispatch
  • merge
  • eliminate finally-boilerplate from RESULT (with contextlib.contextmanager?)
  • graph structure DSL (mostly done: pipe, fork, branch (dispatch))
  • network visualization

Alternative architectures

Why do we have the current architecture?

The current implementation uses coroutines to push data through the pipeline.

This was motivated by the context in which the ancestor of liquidata was written, where a single input stream was required be split into multiple independent output streams fed into separate sinks.

Let's take a step back and ask some questions about this choice: Do we need to push? Do we need coroutines? Why? When? What are the consequences? What alternatives are there?

Function composition

Let's consider a very simple pipeline, consisting of a linear sequence of maps:

pipe(f, g, h)

This could be implemented using any of

  • coroutines
  • generators
  • asyncio
  • function composition

Function composition is the simplest, so why bother with the others?

Changing stream length

Let's throw in a filter or a join:

pipe(f, { p }, g, h)
pipe(f, join, g, h)

Function composition no longer cuts the mustard, because there is no longer a 1-to-1 correspendence between items in the input and output streams: something is needed to shrink or expand the stream.

Stream bifurcation

A different complication, branches:

pipe(f, [x, y, z], g, h)

It's difficult to see how function composition and generators could deal with this.

Joining streams

That last pipe describes a graph that looks something like this:

             x -- y -- z
           /
source -- f -- g -- h

How about

sourceA -- a 
            \
             g --- h
            /
sourceB -- b

Generators can deal with this easily:

map(h, map(g, map(a, sourceA)
              map(b, sourceB)))

but it's not obvious how this would work for function composition or coroutines.

liquidata syntax for multiple sources

What would the liquidata syntax for this look like?

Ideally we'd have another kind of bracket (we've exhausted the possibilities that Python offers: (), [], {}). Let's imagine that <> are valid syntactic brackets:

pipe(b, <sourceA, a>, g, h)  # join two sources
pipe(b, [a, out.A],   g, h)  # branch out into two sinks

Working with syntax available in Python, how about:

pipe(b, source(sourceA, a), g, h)

Recall that the following are already synonymous in liquidata

pipe(f)(data)
pipe(source << data, f)
pipe(data >> source, f)
pipe(source(data), f)

so the following could work

pipe(source << sourceB, b, (source << sourceA , a), g, h)

liquidata used to have a input-variable syntax (called slots) in its earlier prototype. If something like it were resurrected, we could write something along the lines of

fn = pipe(source << slot.B, b, (source << slot.A, a), g, h)
fn(A=sourceA, B=sourceB)

Synchronization

In liquidata []-branches are called independent, because there is absolutely no synchronization between them (in contrast with named branches which are based on namespaces flowing through the pipe and managed with name, get and put). Once the stream has split, the branches can change the length of the stream without the others knowing or caring about it.

We would need to think about the synchronization consequences for multiple independent input streams. I guess that the answer is: it's up to the user to make sure something sensible happens.

close_all

Consider the following in current liquidata

pipe(source << itertools.count(), take(3))
pipe(source << itertools.count(), take(3, close_all=True))

Because of the pull-architecture, the first never returns. In a pull architecture the issue doesn't arise.

So what?

I would like to remove the universal reliance on coroutines, with two main goals

  • Enabling things that were impossible before.

  • Simplifying portions of the code which don't need coroutines.

Name clashes in output namespace

Currently, name clashes in the output namespace are resolved by creating a tuple of all the results that clash. This is not The Right Thing. Think about what should be done instead.

#18 might be part of the solution, or it might not.

Improve error messages

It is a major goal of the project to issue excellent error messages.

No work has been done on this yet.

The way forward is probably to separate the compilation of the network into coroutines into separate phases including

  1. type/error checking
  2. coroutine generation

Some optimization phases should be added eventually too.

This will tie in with other work on alternative architectures (push vs pull, pure composition, etc.).

Hierarchical outputs from branches

Currently we have

pipe([out.Y], out.X) -> Namespace(Y=<branch output>, X=<main output>)

which is great, but then

pipe([out.X], out.X) -> Namespace(X=(<branch output>, <main output>))

which is not so good.

You might say "so don't use the same output name twice in the same network!" but that's easier said than done:

abstraction = [out.X]
pipe(abstraction, out.X)

Might these issues be mitigated by enabling hierarchical naming for branches?

Maybe something like

pipe(out.B([out.X]), out.X) -> Namespace(X=<main out>, B=Namespace(X=<branch out>)

This might form part of a solution to #17.

Improve test suite

There are a bunch of TODOs left from the first version of the code related to the testing suite that shall be addressed in upcoming PRs

  • Test implicit pipes in fork
    fork accepts pipes as arguments, but tuples can be interpreted as pipes and converted implicitly. This feature is not tested.
  • Test count_filter
    This component is not tested.
  • Test spy_count
    This component is not tested.
  • (Test polymorphic result of pipe)
    Not sure this is this still relevant. Are we allowing sinkless pipes?
  • Add test for failure to close sideways in branch
    Branches produce two downstreams in the data flow and both must be closed when data stops flowing. A test should ensure this behavior.
  • Test string_to_pick and its usage
    If the first element in a pipe is a string the data flowing through that pipe is reduced to a specific item. This functionality needs to be tested.
  • Add test for slice with close_all = True
    slice takes an optional argument close_all to decide whether to stop the full pipeline when it runs out of values. The functionality for close_all = True is not tested.

Reorganize out and into

Current situation

Currently out can be used in the following ways

  • out (shorthand for out(into(list)))
  • out(<binary-fold-function>)
  • out(<binary-fold-function>, <initial-value>)
  • out(into(<consumer-of-iterables>))
  • All of the above wth .<name> appended to out.
  • In the absence of out at the end of a pipe, it defaults to out(into(list))

Proposal 1: extract into from out

the situation becomes

  • out (shorthand for into(list))
  • out(<binary-fold-function>)
  • out(<binary-fold-function>, <initial-value>)
  • into(<consumer-of-iterables>)
  • All of the above wth .<name> appended to out or into.
  • In the absence of out or into at the end of a pipe, it defaults to into(list)

Pros

  • This removes the overhead of having to wrap every use of into in an out , by shortening out(into(thing)) -> into(thing).

Cons

  • Previously, every return value (except implicit ones at the end, or ones hidden inside abstractions) were marked with an out. Now they would be marked by out or into.

  • More generally, the universal nature of out is lost. If we ever invent more outputs (e.g. #18) then the lack of a universal out might require more ad-hoc implementation, and avoiding this is a core design principle of liquidata.

Proposal 2: juggle the names

fold consume iterable
Current situation out out(into ...)
Proposal 1 out into
Option A fold into
Option B fold out

Improve error messages

The error messages that appear when something goes wrong in a pipeline, are often difficult to understand.

We should try to detect as many problems as possible at the time graphs are constructed, rather than waiting for things to go pear-shaped when data are sent through the pipe.

Let's collect examples of unhelpful messages in the comments here.

Constant-space into

The current implementation of into is a temporary hack, consequently it is O(N) rather than O(1) in space.

This needs fixing pretty urgently. It is complicated by the push architecture. In a pull architecture it would be trivial.

Change interface for selection of items in dict-streams: pipe decorators

At present, the signature of map is

def map(op=None, *, args=None, out=None, item=None):

The parameters args, out and item are there for the purpose of working with streams which contain dictionaries, in order to select which arguments are passed to the operation, and what to do with its output. map is not the only utility which has such parameters, and it is likely that new utilities might need similar parameters in the future.

Currently, such parameters are interpreted, usually with repetitive, and sometimes convoluted, logic in each such utility itself. Consequently, the same problem is solved repeatedly, and needs to be tested repeatedly for each component that supports item selection.

This is horribly wrong! We should provide utilities whose one and only job is to provide item selection or item setting capability to any pipe, and ensure that they can be applied to utilities such as map (which should be blissfully ignorant of the item-fiddling) in order to achieve the desired effect. In other words, we should apply the philosophy of writing small, testable, reusable, composable units, to this particular feature!

What would it look like?

How about something along the following lines?

def xxx(pipe, in_=None, out=None, on=None, args=False, kwds=False):
  • in_

    • None: xxx forwards incoming data to pipe
    • a single string: xxx picks one item off the dict in the stream and forwards it as sole input to pipe
    • a tuple of strings: xxx picks those items out of the dict and forwards them to pipe
  • out

    • None: send the output of pipe downstream
    • a single string: attach the output of pipe on to the dictionary that xxx received
    • a tuple of strings: this must match the length of the iterable returned by pipe; attach the returned values to the dict that xxx received, under the names in the corresponding positions
  • on=yyy: shorthand for in_=yyy, out=yyy

  • args: make pipe apply *args expansion to whatever it receives

  • kwds: make pipe apply **kwds expansion to whatever it receives

Alternatively, perhaps these can be implemented as 5 separate utilities.

In any case, this idea of 'pipe decorators' will lead to much more composable and reusable code.

Context managers

fn = pipe(open, join, sink(print))

is the equivalent of

def fn(names):
    for name in names:
        for line in open(name):
            print(f)

But how about

def fn(names):
    for name in names:
        with open(name) as file:   # this line was added
            for line in file:
                print(f)

A with-equivalent is needed.

Perhaps, more importantly than side-effects, we need to consider things like

fn = pipe(<with-syntax> open, join, process, { filter }, out(into(collector)))
fn = pipe(<with-syntax> (open, join, process, { filter }), more, stuff, out(into(collector)))

Note that with needs to be aware of where the subpipe to which it applies, ends.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.