Giter VIP home page Giter VIP logo

bubbles's Introduction

Bubbles

Bubbles is a Python ETL Framework and set of tools. It can be used for processing, auditing and inspecting data. Focus is on understandability and transparency of the process.

Project page: http://bubbles.databrewery.org

Blog: http://blog.databrewery.org

About

Bubbles is a Python framework for:

  • ETL (extraction, transformation and loading)
  • preparation of data for further analysis
  • data probing – analysing properties of data, mostly categorical in nature
  • data quality monitoring
  • virtual data objects – abstraction of table-like structured datasets. Datasets are treated the same, no matter whether the source is a text file or a database table.

Installation

Requires at least Python 3.3.

To install Bubbles framework type:

pip install bubbles

To install Bubbles from sources, you can get it from Github:

https://github.com/Stiivi/bubbles

Documentation

Introduction to bubbles (Slideshare presentation)

Operations (Scribd document)

Documentation can be found at: http://packages.python.org/bubbles

Sources

Project source repository is being hosted at Github: https://github.com/Stiivi/bubbles

git clone git://github.com/Stiivi/bubbles.git

Support

If you have questions, problems or suggestions, you can send a message to the Google group or write to the author.

Author

Stefan Urbanek [email protected]

License

Bubbles is licensed under MIT license with following addition:

If your version of the Software supports interaction with it remotely 
through a computer network, the above copyright notice and this permission 
notice shall be accessible to all users.

Simply said, that if you use it as part of software as a service (SaaS) you have to provide the copyright notice in an about, legal info, credits or some similar kind of page or info box.

For full license see the LICENSE file.

bubbles's People

Contributors

erchn avatar ericgazoni avatar steinbro avatar stiivi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bubbles's Issues

Consumable retention policy

Define consumable retention policy. Currently the retention is expected to be provided by the object, which is in most of the cases sub-optimal such as consuming all data into list of Python objects. Also this implementation is not aware of context in which the node is being executed. Suggestion:

  1. ExecutionEngine subclasses might insert retention/caching nodes after consumables. Advantage: simple implementation, aware of broader processing context. Disadvantage: might not be backend aware.
  2. Retention operations: retain(object, times) Advantage: backend aware. Disadvantage: not aware of the broader processing context.

Import in example code fail

From bubbles / doc / operations.rst

The following example code fails:

from bubbles import default_context as c
from bubbles import get_object

source = get_object("csv", "data.csv")
duplicates = c.op.duplicates(source)

$python operations.py
Traceback (most recent call last):
File "operations.py", line 2, in
from bubbles import get_object
ImportError: cannot import name get_object

Error trying to convert from PostgreSQL to Sqlite

I have data in a Postgres DB that I'd like to inject into Sqlite using a Pipeline.

source = open_store("sql", 'postgres://....')
#target = open_store("csv", "./data")
target = open_store("sql", 'sqlite:///data/data.sql')

stores = {
  'source': source,
  'target': target,
}
p = Pipeline(stores=stores)
p.source('source', 'xxx')
p.create('target', 'xxx')
p.run()

I get the following error:

sqlalchemy.exc.CompileError: (in table 'xxx', column 'yyy'): Compiler <sqlalchemy.dialects.sqlite.base.SQLiteTypeCompiler object at 0x102c10910> can't render element of type <class 'sqlalchemy.dialects.postgresql.base.DOUBLE_PRECISION'>

Date to/from string conversions should use SQL format

Operations such as string_to_date should use format as SQL databases use (see PostgreSQL for example).

Reason: more human readable than the strptime() format with %'s

Note that the SQL format is a bit richer and might have different first element indexes (1 vs 0) in some cases.

Examples not working

I have downloaded the latest bubbles source code & installed.

My development machine's environment:

Python 3.3.4
Linux 3.10.32-2-MANJARO x86_64 GNU/Linux

1. Running bubbles / examples / hello.py produces the following error:

Traceback (most recent call last):
  File "hello.py", line 6, in <module>
    d = bubbles.data_object("csv_source", resource=URL, infer_fields=True)
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/objects.py", line 37, in data_object
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 94, in __call__
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 107, in create
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/backends/text/objects.py", line 232, in __init__
TypeError: 'infer_fields' is an invalid keyword argument for this function

2. Running bubbles / examples / hello_sql.py produces the following error:

Traceback (most recent call last):
  File "hello_sql.py", line 23, in <module>
    p.run()
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/pipeline.py", line 230, in run
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/engine.py", line 165, in run
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/engine.py", line 24, in evaluate
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/graph.py", line 61, in evaluate
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/objects.py", line 37, in data_object
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 94, in __call__
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 107, in create
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/backends/text/objects.py", line 232, in __init__
TypeError: 'infer_fields' is an invalid keyword argument for this function

3. Running bubbles / examples / aggregate_over_window.py produces the following error:

Traceback (most recent call last):
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 123, in get
KeyError: 'iterable'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "aggregate_over_window.py", line 46, in <module>
    p.run()
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/pipeline.py", line 230, in run
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/engine.py", line 165, in run
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/engine.py", line 24, in evaluate
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/execution/graph.py", line 61, in evaluate
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/objects.py", line 37, in data_object
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 94, in __call__
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 98, in create
  File "/usr/lib/python3.3/site-packages/bubbles-0.2-py3.3.egg/bubbles/extensions.py", line 126, in get
bubbles.errors.InternalError: Unknown extension 'iterable' of type object

Create Quick Reference manual

Create a quick reference manual (PDF preferably) with:

  • list of operations
  • list of stores
  • list of object types and their representations

Retry in nested operation should not replace parent

Curent behavior:

Example situation: a operation PARENT is composed of other CHILD operations

  1. Operation PARENT is called
  2. PARENT calls CHILD
  3. CHILD raises RetryOperation
  4. Context catches the RetryOperation and retries CHILD
  5. Context returns after finishing the CHILD instead of PARENT

PARENT gets never completed.

Expected behavior:

Have a context stack for operation calls.

  1. Operation PARENT is called within PARENT context.
  2. PARENT calls CHILD
  3. New context for CHILD is created
  4. CHILD raises RetryOperation
  5. CHILD context catches the RetryOperation and retries CHILD
  6. CHILD context returns to PARENT
  7. PARENT continues and returns to PARENT context

Example code not working

Running bubbles / examples / hello.py fails because
1)
p.source(bubbles.data_object("csv_source", URL, infer_fields=True)) is missing a name parameter. Should be

p.source(bubbles.data_object("csv_source", URL, infer_fields=True),"foo-name")

Furthermore, the example doesn't seem to produce any output. What is supposed to happen when p.pretty_print() executes?

Darn

This seemed to be the most complete Python ETL package. Why don't Python nerds need to ETL?

Pip install bubbles fails

OS X Mountain Lion, Python 2.7 with Virtualenvwrapper: Installation of bubbles fails due to syntax errors:


pip install bubbles
Downloading/unpacking bubbles
Downloading bubbles-0.1.tar.gz (40kB): 40kB downloaded
Running setup.py egg_info for package bubbles

Installing collected packages: bubbles
Running setup.py install for bubbles
SyntaxError: ('invalid syntax', ('/Users/peder/Envs/cubes/lib/python2.7/site-packages/bubbles/core.py', 222, 30, 'def operation(*signature, name=None):\n'))

Successfully installed bubbles
Cleaning up...
(cubes)peder@garros:/source/crs-cubes$ pip freeze local
Cython==0.19.1
Flask==0.9
Jinja2==2.6
MarkupSafe==0.18
SQLAlchemy==0.7.9
Werkzeug==0.8.3
argparse==1.2.1
bubbles==0.1
chardet==2.1.1
csvkit==0.6.1
cubes==0.10.2post1
dbf==0.95.004
itsdangerous==0.23
json-table-schema==0.1
line-profiler==1.0b3
lxml==3.2.3
messytables==0.12.0
openpyxl==1.6.2
python-dateutil==1.5
python-magic==0.4.3
six==1.4.1
wsgiref==0.1.2
xlrd==0.9.2
(cubes)peder@garros:
/source/crs-cubes$ python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import bubbles
Traceback (most recent call last):
File "", line 1, in
File "/Users/peder/Envs/cubes/lib/python2.7/site-packages/bubbles/init.py", line 4, in
from .core import *
File "/Users/peder/Envs/cubes/lib/python2.7/site-packages/bubbles/core.py", line 222
def operation(*signature, name=None):
^

SyntaxError: invalid syntax

Composed operations have no way of dealing with consumables

Operations that are composed of other operations have no mechanisms to deal with consumable objects as the Pipeline and ExecutionEngine does. If an object to be consumed multiple times, the operation just eats it and then provides faulty result.

Suggestion: move handling of consumables into context and use context.retain(obj, retain_times=1)

Unable to find operation source_object

@Stiivi I've been trying do the sql example, but always I run the program show this message: bubbles.errors.OperationError: Unable to find operation source_object. My bubbles version is 0.1 and SQLAlchemy 1.0.14.

Can you help me or show me the error code?

I want to use bubbles to migrate a legacy database (mssql) to new database (mysql) at my work company.

import bubbles


URL = "https://raw.github.com/Stiivi/cubes/master/examples/hello_world/data.csv"

stores = {
    "target": bubbles.open_store("sql", "sqlite:///")
}

p = bubbles.Pipeline(stores=stores)
p.source_object("csv_source", resource=URL, encoding="utf8")
p.retype({"Amount (US$, Millions)": "integer"})

p.create("target", "data")

p.aggregate("Category", "Amount (US$, Millions)")
p.pretty_print()
p.run()

Thanks.

Simplify handling of tuple list (ordering, aggregations)

This:

p.sort([["firstname", "asc"]])

looks unintuitive, despite being correct.

Allow:

p.sort("firstname")

This would be nice to have, but should not be allowed as it is ambiguous:

p.sort(["firstname", "asc"])

Does it mean to sort firstname ascending or it means sort by fields firstname and asc in default (ascending) order?

import bubbles fails

Python 2.7 - 64bit - Windows

In [1]: import bubbles
File "C:\Anaconda\lib\site-packages\bubbles\core.py", line 222
def operation(*signature, name=None):
^
SyntaxError: invalid syntax

Proposing a PR to fix a few small typos

Issue Type

[x] Bug (Typo)

Steps to Replicate and Expected Behaviour

  • Examine bubbles/execution/context.py and observe sucessfully, however expect to see successfully.
  • Examine bubbles/execution/pipeline.py and observe successfuly, however expect to see successfully.
  • Examine bubbles/execution/pipeline.py and observe sucessful, however expect to see successful.
  • Examine bubbles/backends/sql/ops.py and observe sequentialy, however expect to see sequentially.
  • Examine bubbles/execution/context.py and observe regardles, however expect to see regardless.
  • Examine bubbles/execution/pipeline.py and observe refereced, however expect to see referenced.
  • Examine bubbles/expression.py and observe properely, however expect to see properly.
  • Examine bubbles/execution/pipeline.py and observe prerequisities, however expect to see prerequisites.
  • Examine bubbles/backends/sql/objects.py and observe preferrence, however expect to see preference.
  • Examine bubbles/objects.py and observe oustanding, however expect to see outstanding.
  • Examine bubbles/dev.py and observe objcects, however expect to see objects.
  • Examine bubbles/operation.py and observe multiople, however expect to see multiple.
  • Examine doc/introduction.rst and observe heterogenous, however expect to see heterogeneous.
  • Examine bubbles/metadata.py and observe consutrcuted, however expect to see constructed.
  • Examine bubbles/metadata.py and observe arithmentic, however expect to see arithmetic.
  • Examine bubbles/ops/rows.py and observe agains, however expect to see against.
  • Examine bubbles/objects.py and observe actualy, however expect to see actually.
  • Examine bubbles/objects.py and observe acessed, however expect to see accessed.

Notes

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/bubbles/pull/new/bugfix_typos

Thanks.

Field part reference for compound field types

Allow use of parts of compound/indexable field types such as dates and arrays in operations. Example:

p.filter_by_value(FieldPart("event_date", "year"), 2013)

Advantages:

  • less steps, no need to explicit extraction
  • better readability

Disadvantages:

  • more requirements for implementing backend operations
  • operations might implement this selectively, depending on argument, which might cause inconsistencies

Requirements:

  • Field.is_composed() - True for date, array, record
  • DataObject.concrete_field(field_or_part)

Affected methods:

  • prepare_aggregation_measures()
  • prepare_key()
  • many operations

Recommendation: have this in OperationContext when argument annotations or when operation prototype metadata are implemented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.