yhat / db.py Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 111.0 14.75 MB

db.py is an easier way to interact with your databases

License: BSD 2-Clause "Simplified" License

Python 98.37% Shell 0.37% Batchfile 1.26%

db.py's People

Contributors

Stargazers

Watchers

Forkers

wrobstory leetrout afthill yamikep scari mrwacky42 chenpy228 debjan pooney hammer alienfluid mkemka genfri jaepil douglas-larocca slackorama pombredanne nfredrik changsj adamkal dregitsky raleighgee ueg1990 sheltowt postmind-net drewww szkielet patrick-russell hkpesala richiverse asifiqbal coolcoding manugarri levy5674 catawbasam philngo christiantremblay mindis jkr2110 iadrich ospreyx msiler irfanj rothnic jol1 ricket1978 djhume ashhher3 cophy08 pandeesh giserh bmabey hgrif barbara-ipsy vambati klonuo hilam cy-fir 8dspaces sulozx nkhuyu b11z pathcl tonystank3000 lburdzy sanjc shermm rlugojr nanangarsyad dirkakrid cniclsh kevinwkc brunopradel daiguadaidai san-na wha000tif h1d3r jijicanyu songofhack m00zh33 moriarty2016 vdt yucz guibeira bigrlab clxdsjyx linuxiscool hanfeijp mittm91 rayleyva aaronzmia kutsuki afcarl wleandrooliveira kenxu2015 druphliu anningwang alru aslyr fagan2888

db.py's Issues

Issue connecting postgres database

/Users/gwulfs/anaconda/lib/python2.7/site-packages/db/db.pyc in init(self, username, password, hostname, port, filename, dbname, dbtype, profile, exclude_system_tables, limit)
670
671 if self.dbtype=="postgres" or self.dbtype=="redshift":
--> 672 self.con = pg.connect(user=self.username, password=self.password,
673 host=self.hostname, port=self.port, dbname=self.dbname)
674 self.cur = self.con.cursor()

NameError: global name 'pg' is not defined

Option to return pandas DataFrame from TableSet or ColumnSet

Currently TabletSets and ColumnSets are represented as "tablified" PrettyTables. However when working with a database with lots of tables, or a table with lots of columns (>100) it would be nice to be able to return only a slice of the table.

An easy way to do this would be to have the option to return the TableSets or ColumnSets as pandas DataFrames.

Obviously the pandas DataFrame representation in the IPython terminal is not as pretty, but at least in IPython notebook the HTML representation of DataFrames is quite nice.

Something like:

rows = []
for col in self._columns:
    rows += [{"Column":col.name, "Type":col.type, "Foreign Keys":col._str_foreign_keys(), "Reference Keys":col._str_ref_keys()}]    
df = pd.DataFrame(li)

Perhaps in a _repr_df_ function.

Does not work with MS SQL Server

I can't seem to get the SQL Server driver working on OSX. I've got pyodbc working (which seems to be the most accepted way to query SQL Server from Python).

Any suggestions?

Executing set time_zone queries returns with an error

This likely happens on any SET queries that do not return a result.

UnicodeEncodeError with Unicode query parameter

I get a UnicodeEncodeError with Unicode query parameter with the present 'pip' version.

>>> db.query(u"select * from words where form == '{{ word }}' limit 10", data={'word': u'læge'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/db/db.py", line 1307, in query
    q = self._apply_handlebars(q, data, union)
  File "/usr/local/lib/python2.7/dist-packages/db/db.py", line 1144, in _apply_handlebars
    query = str(query)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 36: ordinal not in     range(128)

In the present GitHub version the error occurs in line 1232.

I have no problem if I interpolate before calling the query method, i.e., avoid self._apply_handlebars.

>>> db.query(u"select * from words where form == 'læge' limit 10")
   index        id  form   pos
0   9275  11031550  læge  Noun

example from twitter apparently doesn't work on Python3

http://blog.yhathq.com/posts/dbpy-0.4-handlebars-meets-sql.html
(trying on python3.4, windows)

D:\WinPython\basedir34\build\winpython-3.4.3.amd64\python-3.4.3.amd64\lib\site-packages\db\db.py in _apply_handlebars(self, q, data, union)
   1034 
   1035     def _apply_handlebars(self, q, data, union=True):
-> 1036         q = unicode(q)
   1037         template = self.handlebars.compile(q)
   1038         if isinstance(data, list):

NameError: name 'unicode' is not defined

S3 credentials

Thedb.to_redshift feature is great, but it does not follow the pattern of DB for database credentials. It would be great if you could put in the S3 credentials once, save it, and load it on the fly. Example of what I'd expect from the package.

from db import S3
s3 = S3(credentials)
s3.save_credentials('s3_for_redshift')
s3 = S3(profile='s3_for_redshift')
...
db.to_redshift(df, s3_cred = s3)

Add support for ms SQL server with pymssql or _mssql

http://pymssql.org/en/latest/

db.py doesn't support multiple statements ?

example / whish :

from db import DB
db=DB(dbtype="sqlite", filename=":memory:")
db.query("""DROP TABLE IF EXISTS writer;")
CREATE TABLE writer (first_name, last_name, year_of_death);
INSERT INTO writer VALUES ('William', 'Shakespeare', 1616);
INSERT INTO writer VALUES ('Bertold', 'Brecht', 1956);
SELECT  * , sqlite_version() as sqlite_version from Writer order by Year_of_death;""")

db.py doesn't support drop/create/insert statements ?

example :
is ok :

from db import DB
db=DB(dbtype="sqlite", filename=":memory:")
db.query("select  sqlite_version() as sqlite_version  ;")

is not ok

db.query("DROP TABLE IF EXISTS writer;")

not ok ?

db.query("CREATE TABLE writer (first_name, last_name, year_of_death);")

not ok ?

db.query("INSERT INTO writer VALUES ('William', 'Shakespeare', 1616);")

Suggestion: Include system tables

Rather than type in all the tables I'd like to exclude from loading in DB(), would it be possible to just select the few tables I'd like to load the schema for? I feel like this would be a ton faster for me.

Postgres schemas are not supported

db.py will identify all tables across all tables, but can't query them because it only tries to access the unqualified table, without the schema prepended.

For example, if I have a schema s and a table t, calling db.all() tries to execute SELECT * FROM t instead of SELECT * FROM s.t. Unless users have configured a postgres search_path (and have unambiguous table names) this query will fail.

Do you really need pandas-0.15?

I tried installing db.py and was alarmed to find it trying to upgrade my pandas version. Given how much of a pain pandas can be to install and how surprising things can break between versions, are you sure you need 0.15.0?

Having had a quick prod, it seems to work fine with 0.14.1

"Add a database" checklist

Since we're getting a few requests of the form "Can you support database X?", it'd be really helpful to have a developer checklist of features to implement when adding support for a new database.

print "var", var

one error in file init.py, in line 69, print "var", var, """is it necessary?""", i have changed to print ("var", var).
I am using the version 3.5.2 of python

Oracle Support

You mentioned Oracle support in the list of oracle support but I can't quite figure it out. I tried dbtype="oracle" and "Oracle". I do have the Oracle package cx_oracle

MSSQL / _gen_tables_from_col_tuples failed - Missing schema

Request in queries/mssql.py is missing table_schema

Postgres/Psycopg2 DatabaseErrror

Whenever I initialize a DB object with my database (Postgres 9.3), and then try to conduct an operation on it (say, show_tables), I get

DatabaseError: server closed the connection unexpectedly
This probably means the server terminated abnormally
    before or while processing the request.

I know that it's some setting particular to my database configuration interacting with db.py, because I can use db.py on other databases, and I can use vanilla psycopg2 on this database without trouble, but I haven't been able to track down what the issue is.

Docs

Need some docs...

Creating DemoDB object in setUp under tests.py

Hello,

First of all this is an amazing project. Will try to contribute as I want to contribute to database related projects in Python :)

I wanted to know why in tests.py did you decalre db = DemoDB() as a global object. You could have done the following in setUp:

def setUp(self):
self.db = DemoDB()

setUp() is called to prepare test fixture and is used to provide initialization for the tests. Since the test.py code is still not that big, you can replace all db variables by self.db. Let me know what you think. I can do this too if you think this is a viable contribution. Also adding:

def tearDown(self):
pass

since that is for cleaning up after all tests are done.

DB.query_from_file limit doesn't work

Call to query do not pass arguments the right way and duplicate the _apply_handlebars function

AttributeError when trying to connect mysql

Python 2.7.5 (default, Oct 14 2013, 11:50:38)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from db import DB
>>> db = DB(username='my id', password='my password', hostname='internal ip', dbname='db name', dbtype="mysql")
Refreshing schema. Please wait...Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data1/program/anaconda/lib/python2.7/site-packages/db/db.py", line 794, in __init__
    self.refresh_schema(exclude_system_tables)
  File "/data1/program/anaconda/lib/python2.7/site-packages/db/db.py", line 1216, in refresh_schema
    self.tables = TableSet([Table(self.con, self._query_templates, t, tables[t]) for t in sorted(tables.keys())])
  File "/data1/program/anaconda/lib/python2.7/site-packages/db/db.py", line 261, in __init__
    foreign_key = Column(con, queries_templates, foreign_table, foreign_column, col.type)
AttributeError: 'str' object has no attribute 'type'

Any ideas?, please

postgres: AttributeError: can't set attribute.

Error appears to be due to the "_" prepended to the column name if the column name is either "name" or "con". In this specific case dim_cell.name gets prepended to form "_name".

--------------+
|      Table       | Name |   Type  | Foreign Keys | Reference Keys |
+------------------+------+---------+--------------+----------------+
| dim_cell | name | varchar |              |                |
+------------------+------+---------+--------------+----------------+

----> db = DB(username="user", password="pw", hostname="hostname",port="5432",dbname="dbname", dbtype="postgres")

C:\Users\user1\AppData\Local\Continuum\Anaconda\lib\site-packages\db\db.py in __init__(self, username, password, hostname, port, filename, dbname, dbtype, schemas, profile, exclude_system_tables, limit, keys_per_column)
    833 
    834         self.tables = TableSet([])
--> 835         self.refresh_schema(exclude_system_tables)
    836         self.handlebars = pybars.Compiler()
    837 
C:\Users\user1\AppData\Local\Continuum\Anaconda\lib\site-packages\db\db.py in refresh_schema(self, exclude_system_tables)
 1339             tables[table_name].append(Column(self.con, self._query_templates, table_name, column_name, data_type, self.keys_per_column))
   1340 
-> 1341         self.tables = TableSet([Table(self.con, self._query_templates, t, tables[t], keys_per_column=self.keys_per_column) for t in sorted(tables.keys())])
   1342         sys.stderr.write("done!\n")
   1343 
C:\Users\user1\AppData\Local\Continuum\Anaconda\lib\site-packages\db\db.py in __init__(self, con, query_templates, name, cols, keys_per_column)
    272             if attr in ("name", "con"):
    273                 attr = "_" + col.name
--> 274             setattr(self, attr, col)
    275 
    276         self._cur.execute(self._query_templates['system']['foreign_keys_for_table'].format(table=self.name))

AttributeError: can't set attribute

db.tables error

download the package today. Appreciate if anyone can help with this error.

Python 2.7.3 (default, Jun 22 2015, 19:33:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

from db import DemoDB
db = DemoDB()
Indexing schema. This will take a second...finished!
db.tables
Refreshing schema. Please wait...Traceback (most recent call last):
File "", line 1, in
File "/home/magnolia/python/lib/python2.7/site-packages/db.py-0.4.4-py2.7.egg/db/db.py", line 959, in tables
self.refresh_schema(self._exclude_system_tables, self._use_cache)
File "/home/magnolia/python/lib/python2.7/site-packages/db.py-0.4.4-py2.7.egg/db/db.py", line 1510, in refresh_schema
tables = self._gen_tables_from_col_tuples(col_meta)
File "/home/magnolia/python/lib/python2.7/site-packages/db.py-0.4.4-py2.7.egg/db/db.py", line 1586, in _gen_tables_from_col_tuples
for (table_schema, table_name, column_name, data_type) in cols:
ValueError: need more than 3 values to unpack

Suggestion: add Oracle DB compatibility

It would be great to add Oracle DB connectivity, given their massive presence in business.

Says Oracle support, but doesn't seems to

My partner didn't get it working in his OSX to access an Oracle DB.

After that, while trying to figure out how to do it, we got confused. Here says "yes, we support", but there says "not yet".

So, what's the right answer?

utf-8 values are double-coded (at least from postgresql)

In a dataframe resulting from e.g. db.tables.table.all(), utf-8 values from postgresql were double-encoded (encoded as utf-8 twice).

When i later had to save my dataframe to an Excelsheet or a .csv-file, i had to do a .decode('utf-8') on all values in the dataframe, for it to be able to export after some troubleshooting.

Query using TRUNCATE fails

Hi,

Thanks for the very cool implementation.

When connected to an Amazon Redshift database, my attempt to TRUNCATE a table

from db import DB
db = DB() 
db.query("TRUNCATE table_name;")

obtained an error

TypeError: 'NoneType' object is not iterable

Does db.py support the TRUNCATE command?

Name conflict between column name and Table property

The schema of the database I'm trying to connect to includes a table with column name 'count'. When I try to connect to it, db.py creates a Table object for that table and calls setattr(self, attr, col) for each column in the table. Since Table has a property called 'count', when it gets to that column, it raises an AttributeError. Obviously a similar problem exists with the other method and field names in Table. Other people have mentioned this problem in issue #33

I'm not sure of the best way to handle this. Table.init already checks to make sure we aren't creating a column attribute with name 'name' or 'con'. If so, it puts an underscore in front of the attribute name (this also seems to be a bug, as _con is the name of the field). This behavior isn't necessarily bad, but it does make it hard to find the column names since underscore attributes are kind of hidden in python. For example, in IPython if I have a Table named Foo with a column attribute named _bar, then typing Foo. does not give me _bar as a suggestion. Apart from looking at attribute names on the class, I don't see any way to determine the columns in a table other than db.find_table("Foo").

Perhaps a combination of putting underscores in front of conflicting column name attributes and adding a 'columns' property that would return a list of column names? That still isn't quite ideal, but it would be better than the current situation which completely prevents me from being able to use db.py, which is a shame because it seems pretty cool.

I'd be happy to submit a pull request, but as I said, I'm not sure of the best resolution.

Can db.py read a in-memory SQLite db?

Hi,

Suppose, for demo or test purposes, I have a in-memory SQLite database such as:

https://www.sqlite.org/inmemorydb.html or
engine = create_engine('sqlite://') if I use SQLAlchemy

Can I read it from db.DB? I test:

from db import DB
db = DB(filename="sqlite://", dbtype='sqlite')

without success. Is there a plan to support it?

Thanks !
Cheers,
Damien G.

Generate schema based on regex

One issue I run into is having databases with large numbers of tables that are user-specific and necessary architecturally. A good solution for me has been using PyCharm/JetBrains products as their database connected allows you to set a regex on which tables to include in the schema. This lets me filter out all of those needless tables and only access the core that I need.

Handlebars partial support

Hello,

First, thanks for this amazing library.

I am trying to template complex SQL queries and would benefit from being able to register partials for use in the handlebar queries.

Is there any way to accomplish that in db.py without any string manipulation shenanigans?

name 'ColumnSet' is not defined

At line 17 of db.py file shouldn't we also import ColumnSet? I am getting the error I pasted in the title section.

Add support for query parameters

Help on method query in module db.db:

query(self, q, limit=None) method of db.db.DB instance
    Query your database with a raw string.

    Parameters
    ----------
    q: str
        Query string to execute
    limit: int
        Number of records to return

Have you considered allowing a params tuple to be passed in that would use the SQL library's SQL escaping capabilities? (db.query('SELECT * FROM table WHERE foo="foo"') does work, but isn't very safe, even for a data exploration library)

queries against information schema at init (MySQL)

it seems that at start up this module runs a high number of queries against the information schema.
making init time exceedingly high

a possible solution might be temporarily disabling the calculation of added statistics using the following
SET innodb_stats_on_metadata=0;

thanks

save_credentials should at least have an option for a simple encryption key

Many analysts work in enterprises where there are some hard and fast compliance rules for storing passwords in cleartext (yes, base64 == cleartext). Adding a option to let folks use a master "encryption key" (yes, that they would have to typed in) to encode/decode the credentials will add a bit of security to this feature and prevent an audit nightmare.

Support for Presto

Are you planning on supporting Presto in the future?
That would be awesome.
http://prestodb.io/docs/current/

'DB' object has no attribute 'driver' when loading MSSQL profile

When trying to load a saved profile, I'm getting this AttributeError:

AttributeError                            Traceback (most recent call last)
<ipython-input-1-3981467c7daa> in <module>()
      2 import pandas as pd
      3 
----> 4 NARDO = DB(profile='NARDO')
      5 #NARDO = DB(hostname='[REDACTED]', dbtype='mssql')

c:\Anaconda\lib\site-packages\db\db.pyc in __init__(self, username, password, hostname, port, filename, dbname, dbtype, schemas, profile, exclude_system_tables, limit, keys_per_column, driver)
    833             if HAS_ODBC:
    834                 base_con = "Driver={driver};Server={server};Database={database};".format(
--> 835                     driver=self.driver or "SQL Server",
    836                     server=self.hostname or "localhost",
    837                     database=self.dbname or ''

AttributeError: 'DB' object has no attribute 'driver'

The odd part is that, I can connect to the database no problem if I specify the hostname as if it's a new connection NARDO = DB(hostname='[REDACTED]', dbtype='mssql'). Not sure if this is also a problem with other DB types--have only tested it on MSSQL

db.py is not compatible with Python3 yet

Silence logging messages

As an enhancement it consider silencing the logging messages, e.g., "Indexing schema. This will take a second..."

Problem using with spatialite

It has problem reading properties of SpatialIndex - VirtualSpatialIndex virtual table even though I can run the same SQL in the CLI.

In ipython:

In [1]: from db import DB
In [2]: db = DB(username=None, password=None, hostname='localhost', filename='/home/bernardo/Dropbox/workspace/orache/db.sqlite3', dbtype='sqlite')
Indexing schema. This will take a second...---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
<ipython-input-2-a38e0de61324> in <module>()
----> 1 db = DB(username=None, password=None, hostname='localhost', filename='/home/bernardo/Dropbox/workspace/orache/db.sqlite3', dbtype='sqlite')

/home/bernardo/envs/devpy/local/lib/python2.7/site-packages/db.py-0.3.1-py2.7.egg/db/db.py in __init__(self, username, password, hostname, port, filename, dbname, dbtype, schemas, profile, exclude_system_tables, limit)
    757             self.con = sqlite.connect(self.filename)
    758             self.cur = self.con.cursor()
--> 759             self._create_sqlite_metatable()
    760         elif self.dbtype=="mysql":
    761             if not HAS_MYSQL:

/home/bernardo/envs/devpy/local/lib/python2.7/site-packages/db.py-0.3.1-py2.7.egg/db/db.py in _create_sqlite_metatable(self)
   1170         tables = [row[0] for row in self.cur.execute("select name from sqlite_master where type='table';")]
   1171         for table in tables:
-> 1172             for row in self.cur.execute("pragma table_info(%s)" % table):
   1173                 rows_to_insert.append((table, row[1], row[2]))
   1174         # find for table and column names

OperationalError: no such module: VirtualSpatialIndex

Large Database leads to resource errors

Is there a way to limit the number of tables being indexed, possibly by schema?

Profiles / Load credentials not working

Raises an exception that db type can't be found.

is boto so much necessary in the requirements ?

boto is heavy, so having it optional would be kind

Add support for index_col

In query you call through to pd.io.sql.read_sql() but don't provide a way of passing in the index_col you want to use. Can this be added to the method signature?

Ask for SSL over Postgres/RDS

Is there a way to specify SSL over the connection using the guide here: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.SSL

feature: add 'count' property/method or len to Table to get the number of rows

Hi,

I would like to get the number of rows in a specific table. Is it possible to add a count or rows method/property to the Table class? You can also add the __len__ special method.

Moreover, this number of rows will be added to the TableSet repr.

Cheers,
Damien

question about the template engine pybars

Hi,

I'm just curious. Why do you use the template engine pybars instead of a more mainstream package such as jinja2? There are some features that you use in pybars which are not implemented elsewhere? Or the author(s) have already known pybars so they just use it?

Thanks. Great job by the way, very (very) useful.
Damien

Getting DB/table information

Hi there, first of all thanks for writing this awesome project.

What are you thoughts on adding a way to get db/table size to this library? I was thinking something along the lines of:

db = DB(profile='default')
db.size  # Returns db size
db.tables.my_table.size  # Returns table size
db.tables.my_table.rows  # Returns number of rows

The ability to run EXPLAIN statements would also be really useful.

Connection Issue: AttributeError "Can't set attribute" /DataBase Error "Server closed the connection unexpectedly"

I'm getting a strange set of errors when trying to establish a database connection. Initially, it gave me:

    273                 attr = "_" + col.name
    274             #print col
--> 275             setattr(self, attr, col)
    276
    277         self._cur.execute(self._query_templates['system']['foreign_keys_for_table'].format(table=self.name))

AttributeError: can't set attribute

Since this is a pretty complicated database, I thought that it might help to exclude system tables, so I set that parameter in the DB() call to True. Then, when I ran that same call call successive times, it began to alternate between the above error and this followingerror (I never seemed to get the same one two times in a row, so it does appear to be strict alternation.)

    275             setattr(self, attr, col)
    276
--> 277         self._cur.execute(self._query_templates['system']['foreign_keys_for_table'].format(table=self.name))
    278         for (column_name, foreign_table, foreign_column) in self._cur:
    279             col = getattr(self, column_name)

DatabaseError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.