Giter VIP home page Giter VIP logo

brewery's Introduction

Hi there ๐Ÿ‘‹

Open-Source Projects

Current

  • Open Poiesis โ€“ collection of tools and libraries for developing applications for systems thinking, dynamical systems modelling and simulation.

Past

  • Cubes โ€“ lightweight Python OLAP library and server (not maintained)
  • Expressions โ€“ arithmetic expression parsing and conversion (not maintained)
  • Bubbles โ€“ Python ETL library (not maintained)
  • StepTalk โ€“ Smalltalk Scripting for Objective C and GNUstep

Experiments

  • Sepro18 โ€“ experimental toolkit for biochemistry inspired graph-based simulation

brewery's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

brewery's Issues

Add context management to data streams

It should be possible to do:

with CSVSourceStream("foo.csv") as stream:
    # ... do something useful

initialize() and finalize() will happen automatically. Fields should be passed to the object initialization.

Class name "Stream" causes confusion, should be distinctive

Problem: There are two kind of Streams by name: data source/target streams, and stream as network of nodes. This causes confusion.

Class Stream should be renamed to Flow or something with similar meaning, but distinctive name.

Considering this as usability bug.

Data source/target json schema definition

Create a json schema definition for describing data source/target. Something like:

{
    "type" = "csv",
    "url" = "http://some.domain.com/file.csv",
    "format" = "excel",
}

Also create a json schema for collection of such data sources/targets.

Do not allow python-only nodes and attributes in brewery runner

Current state: users can use any node provided by brewery in their runner. This might cause crash and undesired unpleasant behaviour of the tool.

Solution: add node metadata to node_info for each node to state whether the node can be used or not in the runner/outside of the python environment. Do the same with node arguments.

Namespace should be more flat

Problem: Rethink the module structure - it might be confusing. Currently some module namespace is there just for taxonomy reasons.

Make requirement to import just brewery: user will get everything except backend-related stuff.

That does not mean that the modules will be removed, just users will NOT be encouraged to use them. Rules:

  • no example should show use of sub-modules, except when backends are involved
  • documentation should contain reference for 'un-moduled' objects
  • in documentation there should be a note, if there is any reason that user would want only one Brewery module, he might.

Considered as usability change.

Users should not have to ask "What package this class/function is in?"

Add node identifiers to node reference

Node reference documentation should contain node identifiers rather than class names. Node identifiers are used in "HOM"-based stream construction and in dictionary based stream construction - more used than class based node creation.

XLRD Issue?

I have some code based on the example brewery "merge 2 XLS files". It works on MacOS in Python 2.7.2. When I move it to a new Windows 7 install (Python 2.7.3, brewery and python-excel installed via pip today), it fails with:

Traceback (most recent call last):
File "C:\Users\kseefried\workspace\Preparis User Data\src\merge-ng.py", line 46, in
src.initialize()
File "C:\Python27\lib\site-packages\brewery-0.8.0-py2.7.egg\brewery\ds\xls_streams.py", line 47, in initialize
encoding_override=self.encoding)
File "C:\Python27\lib\site-packages\xlrd__init__.py", line 443, in open_workbook
ragged_rows=ragged_rows,
File "C:\Python27\lib\site-packages\xlrd\book.py", line 94, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 1264, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Python27\lib\site-packages\xlrd\book.py", line 1258, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xd0\xcf\x11\xe0\xa1\xb1'

I have:

  • Verified the files are in fact valid XLS files.
  • Saved As Excel 95 files.
  • Create new spreadsheets outside of Excel using xlrd, cut-n-paste the data into the new spreadsheet.

I'm at a loss as to how to move forward.

Check documentation examples

Documentation examples should be checked whether they work. If they don't then they should be fixed or replaced.

Coalesce values in node configuration

Description: when running stream from command line (runner run), it is not possible to pass non-string values to the node configuration.

Proposal: Add data type to the node attributes and add option to the Node.configure() method to coalesce values to the node attribute expected type. Consider using metadata.coalesce_value().

Web Interface

Hello i am reading about your project, i want to know if this get and interface, if not i can start a project based in OpenERP to write an interface.

Wait your comments.

Best regards,

Introduce brewery.backends package

Introduce new package: brewery.backends to hold backend-related functionality. ds.* should be moved there and perhaps names linked back.

Backends should provide: data source(s), data target(s), backend specific probes, ...

Reason: generic streaming is nice but there might be cases where one would like to have backend-specific processing. For example for performance reasons. Doing distinct on SQL table might be slow in python, is much more faster using native SQL.

This package might serve for the future plans, like:

Backend-based nodes and related data transfer between backend nodes โ€“ For example, two SQL nodes might pass data through a database table instead of built-in data pipe or two numpy/scipy-based nodes might use numpy/scipy structure to pass data to avoid unnecessary streaming. Not very soon, but foreseeable future.

brewery pipe command

Create a brewery pipe command with following arguments:

  • From source to target: source_json_string target_json_string
  • From source to stdout: --source source_json_string
  • From stdin to target: --target target_json_string
  • if source/target is recognized as filename or URL then CSV is implied if not specified otherwise
  • --source-type - explicit source format type
  • --target-type - explicit target format type

Future possibility:

  • --feed or something like that for continuous feeding of new entries (Q: how are new entries defined?)

Remove metadata.fieldlist function

Current state: There is FieldList class and fieldlist function which does nothing, just returns a class with the fields. It is redundant and the original purpose of the (future of the) function is not known any more.

Remove getter/setter pattern for fields from streams/nodes

Current state: My background was ObjectiveC and i've introduced unnecessarily getter/setter pattern for data streams. It introduced over-complication.

Proposed replacement:

  • how fields are implemented is up-to the subclass
  • subclasses should provide method "read_fields()" that will read fields if they can be read (CSV headers, table structure, guessing from a list of dictionaries,...)
  • subclasses should provide attribute "reads_fields" indicating whether it can/should read the fields or not

Filed as bug because it is wrongly used pattern issue.

Non-threaded execution and store-based pipes

There should be a backend(s) for non-threaded execution of nodes. Idea:

  • nodes are executed in sorted order

  • data pipes are represented by tables - a node reads from a table-like structure and appends to a table-like structure

    node --append()--> table --rows()--> node

Optimisation:

  • multiple nodes can reuse single table if they just add another column and maintain the number of records

    node --append()--> shared_table --rows()--> node --...
    ... --append()--> shared_table + new_fields --rows()--> node

  • virtual views can be used to reuse the table even if the nodes do not work on the same number of records, but do not change structure

  • persistent workspace should be introduced so we can cache nodes - then execution up to the cached node is ignored

Possibility to explore: PyTables

Allow aggregation node to work by window

Add possibility for aggregate node to output aggregated records based on the window. Where window might be:

  • number of records
  • change of value in a field (field set): 1 1 1 2 2 2 2 2 3 3 3 3 4 5 5 5 5 5 --> group by 1, 2, 3, 4, 5
  • value in a field is true: 0 0 0 0 1 0 1 0 0 0 1 1 --> 5 records, 2 records, 4 records, 1 record

VARCHAR requires a length on dialect mysql

SQLTableDataTarget requires fix for the error: "VARCHAR requires a length on dialect mysql".

Possible solutions:

  • add "size" to the type metadata
  • if there is no size, use some default value (this should be backend dependent - which is not very nice/clean)

AttributeError: 'SQLDataSource' object has no attribute 'field_names'

Given this code and running it from brewery latest master:

import brewery
import brewery.ds as ds
import brewery.metadata as metadata

from sqlalchemy import Table, Column, Integer, String, Text, DateTime, ForeignKey
from sqlalchemy import create_engine, MetaData


engine = create_engine(
  "postgresql://postgres@localhost/abc"
)

metadata = MetaData(engine)
# fields = metadata.FieldList(["articulo", "titulo"])

stream = ds.SQLDataSource(connection=engine, table="articulos", schema="abc")
print stream.fields

target = ds.MongoDBDataTarget(collection="articulos",
                              database="test_database")

for row in stream.records():
  print "Appending row " + str(row)
  target.append(row)

I get the following error:

AttributeError: 'SQLDataSource' object has no attribute 'field_names'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.