stiivi / brewery Goto Github PK

IMPORTANT: Data Brewery is now Bubbles: https://github.com/stiivi/bubbles This brewery repository is NOT MAINTAINED any more.

Home Page: http://bubbles.databrewery.org

License: Other

Python 98.55% Shell 1.45%

brewery's Introduction

Hi there 👋

Open-Source Projects

Current

Open Poiesis – collection of tools and libraries for developing applications for systems thinking, dynamical systems modelling and simulation.
- PoieticCore: Swift library for systems modelling.
- PoieticFlows – Stock and Flow domain model for PoieticCore library. Includes the Poietic command-line tool for working with models and running simulations.

Past

Cubes – lightweight Python OLAP library and server (not maintained)
Expressions – arithmetic expression parsing and conversion (not maintained)
Bubbles – Python ETL library (not maintained)
StepTalk – Smalltalk Scripting for Objective C and GNUstep

Experiments

Sepro18 – experimental toolkit for biochemistry inspired graph-based simulation

brewery's People

Stargazers

Watchers

Forkers

aparo ntulip dozean smoothdeveloper maurodoglio ceumicrodata rquevedo mindsocket bearrito chishaku ironman- jilott

brewery's Issues

Add context management to data streams

It should be possible to do:

with CSVSourceStream("foo.csv") as stream:
    # ... do something useful

initialize() and finalize() will happen automatically. Fields should be passed to the object initialization.

create "system script" node

Create a node that will launch a script/binary tool and will pipe data to/from the tool.

Class name "Stream" causes confusion, should be distinctive

Problem: There are two kind of Streams by name: data source/target streams, and stream as network of nodes. This causes confusion.

Class Stream should be renamed to Flow or something with similar meaning, but distinctive name.

Considering this as usability bug.

Data source/target json schema definition

Create a json schema definition for describing data source/target. Something like:

{
    "type" = "csv",
    "url" = "http://some.domain.com/file.csv",
    "format" = "excel",
}

Also create a json schema for collection of such data sources/targets.

Do not allow python-only nodes and attributes in brewery runner

Current state: users can use any node provided by brewery in their runner. This might cause crash and undesired unpleasant behaviour of the tool.

Solution: add node metadata to node_info for each node to state whether the node can be used or not in the runner/outside of the python environment. Do the same with node arguments.

CSV DataSource in nodes

CSVDataSource is found in brewery.nodes.*

Database connection not closing

When using Stream.run() where a database is involved, the database connection is not closed on completion of the run.

Namespace should be more flat

Problem: Rethink the module structure - it might be confusing. Currently some module namespace is there just for taxonomy reasons.

Make requirement to import just brewery: user will get everything except backend-related stuff.

That does not mean that the modules will be removed, just users will NOT be encouraged to use them. Rules:

no example should show use of sub-modules, except when backends are involved
documentation should contain reference for 'un-moduled' objects
in documentation there should be a note, if there is any reason that user would want only one Brewery module, he might.

Considered as usability change.

Users should not have to ask "What package this class/function is in?"

Add node identifiers to node reference

Node reference documentation should contain node identifiers rather than class names. Node identifiers are used in "HOM"-based stream construction and in dictionary based stream construction - more used than class based node creation.

XLRD Issue?

I have some code based on the example brewery "merge 2 XLS files". It works on MacOS in Python 2.7.2. When I move it to a new Windows 7 install (Python 2.7.3, brewery and python-excel installed via pip today), it fails with:

Traceback (most recent call last):
File "C:\Users\kseefried\workspace\Preparis User Data\src\merge-ng.py", line 46, in
src.initialize()
File "C:\Python27\lib\site-packages\brewery-0.8.0-py2.7.egg\brewery\ds\xls_streams.py", line 47, in initialize
encoding_override=self.encoding)
File "C:\Python27\lib\site-packages\xlrd__init__.py", line 443, in open_workbook
ragged_rows=ragged_rows,
File "C:\Python27\lib\site-packages\xlrd\book.py", line 94, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 1264, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Python27\lib\site-packages\xlrd\book.py", line 1258, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xd0\xcf\x11\xe0\xa1\xb1'

I have:

Verified the files are in fact valid XLS files.
Saved As Excel 95 files.
Create new spreadsheets outside of Excel using xlrd, cut-n-paste the data into the new spreadsheet.

I'm at a loss as to how to move forward.

AggregateNode: Derive storage type from measure storage type

Currently all aggregations except record count are of type float. Use same type as original field storage type.

Generic method for creating data source/target

Create generic method that will return appropriate source/target object based on the dictionary (json) schema defined by #4.

Check documentation examples

Documentation examples should be checked whether they work. If they don't then they should be fixed or replaced.

Sort list of nodes in 'brewery nodes'

brewery nodes shell command should sort nodes in alphabetical order.

Coalesce values in node configuration

Description: when running stream from command line (runner run), it is not possible to pass non-string values to the node configuration.

Proposal: Add data type to the node attributes and add option to the Node.configure() method to coalesce values to the node attribute expected type. Consider using metadata.coalesce_value().

Web Interface

Hello i am reading about your project, i want to know if this get and interface, if not i can start a project based in OpenERP to write an interface.

Wait your comments.

Best regards,

Introduce brewery.backends package

Introduce new package: brewery.backends to hold backend-related functionality. ds.* should be moved there and perhaps names linked back.

Backends should provide: data source(s), data target(s), backend specific probes, ...

Reason: generic streaming is nice but there might be cases where one would like to have backend-specific processing. For example for performance reasons. Doing distinct on SQL table might be slow in python, is much more faster using native SQL.

This package might serve for the future plans, like:

Backend-based nodes and related data transfer between backend nodes – For example, two SQL nodes might pass data through a database table instead of built-in data pipe or two numpy/scipy-based nodes might use numpy/scipy structure to pass data to avoid unnecessary streaming. Not very soon, but foreseeable future.

brewery pipe command

Create a brewery pipe command with following arguments:

From source to target: source_json_string target_json_string
From source to stdout: --source source_json_string
From stdin to target: --target target_json_string
if source/target is recognized as filename or URL then CSV is implied if not specified otherwise
--source-type - explicit source format type
--target-type - explicit target format type

Future possibility:

--feed or something like that for continuous feeding of new entries (Q: how are new entries defined?)

Remove metadata.fieldlist function

Current state: There is FieldList class and fieldlist function which does nothing, just returns a class with the fields. It is redundant and the original purpose of the (future of the) function is not known any more.

Remove getter/setter pattern for fields from streams/nodes

Current state: My background was ObjectiveC and i've introduced unnecessarily getter/setter pattern for data streams. It introduced over-complication.

Proposed replacement:

how fields are implemented is up-to the subclass
subclasses should provide method "read_fields()" that will read fields if they can be read (CSV headers, table structure, guessing from a list of dictionaries,...)
subclasses should provide attribute "reads_fields" indicating whether it can/should read the fields or not

Filed as bug because it is wrongly used pattern issue.

Non-threaded execution and store-based pipes

There should be a backend(s) for non-threaded execution of nodes. Idea:

nodes are executed in sorted order
data pipes are represented by tables - a node reads from a table-like structure and appends to a table-like structure

node --append()--> table --rows()--> node

Optimisation:

multiple nodes can reuse single table if they just add another column and maintain the number of records

node --append()--> shared_table --rows()--> node --...
... --append()--> shared_table + new_fields --rows()--> node
virtual views can be used to reuse the table even if the nodes do not work on the same number of records, but do not change structure
persistent workspace should be introduced so we can cache nodes - then execution up to the cached node is ignored

Possibility to explore: PyTables

Allow aggregation node to work by window

Add possibility for aggregate node to output aggregated records based on the window. Where window might be:

number of records
change of value in a field (field set): 1 1 1 2 2 2 2 2 3 3 3 3 4 5 5 5 5 5 --> group by 1, 2, 3, 4, 5
value in a field is true: 0 0 0 0 1 0 1 0 0 0 1 1 --> 5 records, 2 records, 4 records, 1 record

add "size" to the type metadata
if there is no size, use some default value (this should be backend dependent - which is not very nice/clean)

import brewery
import brewery.ds as ds
import brewery.metadata as metadata

from sqlalchemy import Table, Column, Integer, String, Text, DateTime, ForeignKey
from sqlalchemy import create_engine, MetaData


engine = create_engine(
  "postgresql://postgres@localhost/abc"
)

metadata = MetaData(engine)
# fields = metadata.FieldList(["articulo", "titulo"])

stream = ds.SQLDataSource(connection=engine, table="articulos", schema="abc")
print stream.fields

target = ds.MongoDBDataTarget(collection="articulos",
                              database="test_database")

for row in stream.records():
  print "Appending row " + str(row)
  target.append(row)

I get the following error:

AttributeError: 'SQLDataSource' object has no attribute 'field_names'