yhat / pandasql Goto Github PK

sqldf for pandas

License: MIT License

Python 99.38% Shell 0.62%

pandasql's Introduction

pandasql

pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

Installation

$ pip install -U pandasql

Basics

The main function used in pandasql is sqldf. sqldf accepts 2 parametrs

a sql query string
a set of session/environment variables (locals() or globals())

Specifying locals() or globals() can get tedious. You can define a short helper function to fix this.

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Querying

pandasql uses SQLite syntax. Any pandas dataframes will be automatically detected by pandasql. You can query them as you would any regular SQL table.

$ python
>>> from pandasql import sqldf, load_meat, load_births
>>> pysqldf = lambda q: sqldf(q, globals())
>>> meat = load_meat()
>>> births = load_births()
>>> print pysqldf("SELECT * FROM meat LIMIT 10;").head()
                  date  beef  veal  pork  lamb_and_mutton broilers other_chicken turkey
0  1944-01-01 00:00:00   751    85  1280               89     None          None   None
1  1944-02-01 00:00:00   713    77  1169               72     None          None   None
2  1944-03-01 00:00:00   741    90  1128               75     None          None   None
3  1944-04-01 00:00:00   650    89   978               66     None          None   None
4  1944-05-01 00:00:00   681   106  1029               78     None          None   None

joins and aggregations are also supported

>>> q = """SELECT
        m.date, m.beef, b.births
     FROM
        meats m
     INNER JOIN
        births b
           ON m.date = b.date;"""
>>> joined = pyqldf(q)
>>> print joined.head()
                    date    beef  births
403  2012-07-01 00:00:00  2200.8  368450
404  2012-08-01 00:00:00  2367.5  359554
405  2012-09-01 00:00:00  2016.0  361922
406  2012-10-01 00:00:00  2343.7  347625
407  2012-11-01 00:00:00  2206.6  320195

>>> q = "select
           strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year;"
>>> print pysqldf(q).head()
   year  beef_total
0  1944        8801
1  1945        9936
2  1946        9010
3  1947       10096
4  1948        8766

More information and code samples available in the examples folder or on our blog.

pandasql's People

Contributors

Stargazers

Watchers

Forkers

hernamesbarbara stonebig anb2 snazz2001 jonathan-bower tomaskazemekas wwdxfa balajin-cse makenoteshere inim4 huangwangping adslyw ouseful-backup pombredanne gmodena stephanieleevillanueva szeitlin avinash9 airtoxin mmarboeuf ougni joshua-wu luckywind rlugojr allenpc jeevan-kr mhan11 tianhm mjgrav2001 melcutz leo-hzau braunmagrin emileemc radovankavicky inc3ase fpcmotif dzqoo gubobo jimmyyfeng baderbuddy djrlj694 andreaslang philipjadler yasiral fhernandez173 peter-john88 daiguadaidai dwmclary wujuguang maxliu statfungen liyijincom mklein0 codeaudit dataxujing xingchuang josecbf cluo flyfree5 solertis stillmatic niejn rpygamer zhiyongcehn ykankaya shufengjing autowonderman arfu2016 lcl1995225 abiraja2004 gronders yangspeaking tuchang anirband blake2002 vivekwisdom danyalandriano mgbin088 mbasilyan craigsdennis wdjlover egm3387 wolfws kkyon kimiraikking lemonfgc mengjin001 shendlcode abeusher cyberican brendanschell liuendy rishisinha pengyuange shawnz42 neeraj0409 jonnickerson juliansteam roger-l fpx406

pandasql's Issues

Example code does not work...

Perhaps my config but.... errors on the pysqldf line.

from pandasql import sqldf, load_meat, load_births
pysqldf = lambda q: sqldf(q, globals())
meat = load_meat()
births = load_births()
print pysqldf("SELECT * FROM meat LIMIT 10;").head()

"InterfaceError: Error binding parameter 1 - probably unsupported type."

using pandas .14 Any thoughts?

Support for this project

From the title of one of the "recent" (ordinal but not calendar wise) commits 15511b9

"this is my project. there are many like it, but this one is fucking mine"

Well - then are you planning to do anything with it? There has been little activity since that post and NONE in the past ten months - and the issues are piling up.

When pandasql works - it is v v useful. But it does have many bugs - do you have a suggestion on a path forward here?

LEFT OUTER JOIN resolves to LEFT JOIN

Left outer join should provide only the data that does not have a match to the right table. Instead this seems to resolve to a left join.

import pandas as pd
from pandasql import PandaSQL

df_a = pd.DataFrame({'test': pd.Series([1., 2., 3., 4.])})
df_b = pd.DataFrame({'test': pd.Series([1., 2., 3.])})

pdsql = PandaSQL()
joined_df = pdsql("SELECT a.* FROM df_a a LEFT OUTER JOIN df_b b on a.test = b.test")
print(joined_df)

Outputs:

   test
0   1.0
1   2.0
2   3.0
3   4.0

Expected Output:

   test
3   4.0

pandasql test failures

I am trying to build pandasql from source on ubuntu16.04-ppc64le. Build was ssuccessful but nosetests and py.test gives errors.
command py.test gives:
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/default.py:393:

dsn = 'host=localhost user=postgres', connection_factory = None
cursor_factory = None, kwargs = {'host': 'localhost', 'user': 'postgres'}
kwasync = {}

def connect(dsn=None, connection_factory=None, cursor_factory=None, **kwargs):
    """
    Create a new database connection.

    The connection parameters can be specified as a string:

        conn = psycopg2.connect("dbname=test user=postgres password=secret")

    or using a set of keyword arguments:

        conn = psycopg2.connect(database="test", user="postgres", password="secret")

    Or as a mix of both. The basic connection parameters are:

    - *dbname*: the database name
    - *database*: the database name (only as keyword argument)
    - *user*: user name used to authenticate
    - *password*: password used to authenticate
    - *host*: database host address (defaults to UNIX socket if not provided)
    - *port*: connection port number (defaults to 5432 if not provided)

    Using the *connection_factory* parameter a different class or connections
    factory can be specified. It should be a callable object taking a dsn
    argument.

    Using the *cursor_factory* parameter, a new default cursor factory will be
    used by cursor().

    Using *async*=True an asynchronous connection will be created. *async_* is
    a valid alias (for Python versions where ``async`` is a keyword).

    Any other keyword parameter will be passed to the underlying client
    library: the list of supported parameters depends on the library version.

    """
    kwasync = {}
    if 'async' in kwargs:
        kwasync['async'] = kwargs.pop('async')
    if 'async_' in kwargs:
        kwasync['async_'] = kwargs.pop('async_')

    if dsn is None and not kwargs:
        raise TypeError('missing dsn and no parameters')

    dsn = _ext.make_dsn(dsn, **kwargs)

  conn = _connect(dsn, connection_factory=connection_factory, **kwasync)

E psycopg2.OperationalError: fe_sendauth: no password supplied

/root/env3/lib/python3.5/site-packages/psycopg2/init.py:130: OperationalError

The above exception was the direct cause of the following exception:

pdsql = <pandasql.sqldf.PandaSQL object at 0x3fffabb045c0>

@pytest.mark.parametrize('pdsql', [False], indirect=True)
def test_no_sideeffect_leak(pdsql):

  pdsql("CREATE TABLE tbl (col INTEGER)")

pandasql/tests/test_pandasql.py:245:

pandasql/sqldf.py:52: in call
with self.conn as conn:
/usr/lib/python3.5/contextlib.py:59: in enter
return next(self.gen)
pandasql/sqldf.py:82: in conn
conn = self.engine.connect()
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/base.py:2083: in connect
return self._connection_cls(self, **kwargs)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/base.py:90: in init
if connection is not None else engine.raw_connection()
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/base.py:2169: in raw_connection
self.pool.unique_connection, _connection)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/base.py:2143: in _wrap_pool_connect
e, dialect, self)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/base.py:1457: in _handle_dbapi_exception_noconnection
exc_info
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/util/compat.py:203: in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/util/compat.py:186: in reraise
raise value.with_traceback(tb)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/base.py:2139: in _wrap_pool_connect
return fn()
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:328: in unique_connection
return _ConnectionFairy._checkout(self)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:766: in _checkout
fairy = _ConnectionRecord.checkout(pool)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:516: in checkout
rec = pool._do_get()
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:1229: in _do_get
return self._create_connection()
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:333: in _create_connection
return _ConnectionRecord(self)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:461: in init
self.__connect(first_connect_check=True)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/pool.py:651: in __connect
connection = pool._invoke_creator(self)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/strategies.py:105: in connect
return dialect.connect(*cargs, **cparams)
/root/env3/lib/python3.5/site-packages/SQLAlchemy-1.1.7-py3.5-linux-ppc64le.egg/sqlalchemy/engine/default.py:393: in connect
return self.dbapi.connect(*cargs, **cparams)

dsn = 'host=localhost user=postgres', connection_factory = None
cursor_factory = None, kwargs = {'host': 'localhost', 'user': 'postgres'}
kwasync = {}

def connect(dsn=None, connection_factory=None, cursor_factory=None, **kwargs):
    """
    Create a new database connection.

    The connection parameters can be specified as a string:

        conn = psycopg2.connect("dbname=test user=postgres password=secret")

    or using a set of keyword arguments:

        conn = psycopg2.connect(database="test", user="postgres", password="secret")

    Or as a mix of both. The basic connection parameters are:

    - *dbname*: the database name
    - *database*: the database name (only as keyword argument)
    - *user*: user name used to authenticate
    - *password*: password used to authenticate
    - *host*: database host address (defaults to UNIX socket if not provided)
    - *port*: connection port number (defaults to 5432 if not provided)

    Using the *connection_factory* parameter a different class or connections
    factory can be specified. It should be a callable object taking a dsn
    argument.

    Using the *cursor_factory* parameter, a new default cursor factory will be
    used by cursor().

    Using *async*=True an asynchronous connection will be created. *async_* is
    a valid alias (for Python versions where ``async`` is a keyword).

    Any other keyword parameter will be passed to the underlying client
    library: the list of supported parameters depends on the library version.

    """
    kwasync = {}
    if 'async' in kwargs:
        kwasync['async'] = kwargs.pop('async')
    if 'async_' in kwargs:
        kwasync['async_'] = kwargs.pop('async_')

    if dsn is None and not kwargs:
        raise TypeError('missing dsn and no parameters')

    dsn = _ext.make_dsn(dsn, **kwargs)

  conn = _connect(dsn, connection_factory=connection_factory, **kwasync)

E sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) fe_sendauth: no password supplied

please help me for configuring postgresql server for running pandasql and remove the above error.
py.test pandasql.txt
nosetests pandasql.txt

Installation throws a PermissionError on Windows

Hi there!

I noticed that running the indicated installation command:

> pip install -U git+https://github.com/aplavin/pandasql.git

...resulted in a PermissionError:

PermissionError: [WinError 5] Access is denied: 'C:\Users\aburnett\AppData\Local\Temp\pip-k5bywsvx-uninstall\python\python35\lib\site-packages\pandas\algos.cp35-win_amd64.pyd'

Actually I received the permission error the first time, reran the same command a second time, received the error a second time, ran the command a third time, received a similar error, and reran the command a fourth time before installation completed successfully. It seemed that it was throwing PermissionErrors for various components in the /lib directory.

Support for dictionary of dataframes

I have created a dictionary of dataframes (dinamically), and I'm trying to read it this way:

SELECT * FROM dfs["ChrUn"] WHERE AND ABS(sstart - 234588) <= 5 AND ABS(send - 234786) <= 5

res = ps.sqldf(sql, locals())

But i get:

AttributeError Traceback (most recent call last)
in ()
5 sql += ' AND ABS(sstart - %i) <= %i AND ABS(send - %i) <= %i ' % (row.sstart, nt_sep, row.send, nt_sep)
6 print(sql)
----> 7 res = ps.sqldf(sql, locals())
8 if len(res.index) > 1:
9 lengths = {}

~/TEs_papa/venv/lib/python3.5/site-packages/pandasql/sqldf.py in sqldf(query, env, db_uri)
154 >>> sqldf("select avg(x) from df;", locals())
155 """
--> 156 return PandaSQL(db_uri)(query, env)

~/TEs_papa/venv/lib/python3.5/site-packages/pandasql/sqldf.py in call(self, query, env)
56 continue
57 self.loaded_tables.add(table_name)
---> 58 write_table(env[table_name], table_name, conn)
59
60 try:

~/TEs_papa/venv/lib/python3.5/site-packages/pandasql/sqldf.py in write_table(df, tablename, conn)
119 message='The provided table name '%s' is not found exactly as such in the database' % tablename)
120 to_sql(df, name=tablename, con=conn,
--> 121 index=not any(name is None for name in df.index.names)) # load index into db if all levels are named
122
123

AttributeError: 'dict' object has no attribute 'index'

Any ideas how can I get this working?

pandasql==0.7.3

Support of 'where exists' operations

Hi, i have a query i am trying to execute in pandasql but my jupyter environment freezes so i am assuming this is not supported. Query trying to execute:

q1="""
SELECT
DATE(Timestamp) AS date,
COUNT(DISTINCT User) AS user_count
FROM
df pc
WHERE EXISTS (
SELECT *
FROM df pc1
WHERE
pc1.User = pc.User AND
DATE(pc1.Timestamp) < DATE(pc.Timestamp)
)
GROUP BY pc.Timestamp
"""

Patch ideas submission

Hi Greg,

Over my patch #15 concerning "bind variable" (or sql parameters), I have several ideas of follow-up patches :

"sql normalization" (in advance)
optional sql parameter notation (":" prefix, ":a_dataframe") for tables to upload to sql,
"we can query native sql tables"
sql tables can have been created by previous sql requests,
so keep "table not found" message only if a table name has the ":" prefix notation,
"accept several sql orders"
result is feeded with the last sql instruction that gave a query result,
"uploaded tables don't overtstep sql native tables"
tables uploaded from python space tables are created with a dedicated prefix, like "tmp",
tables uploaded from python space are removed at the end of the sql orders.

Are you favorable for any of those 4 (+ #15) ideas ?

pandasql can also work with lists

Hello,

Would it be compatible with pandasql current usages to accept input tables to be also simple lists ?
A List is considered a dataframe with 'automatic' columns names c0, c1, ... , cN

==> It would make pandasql usable without knowing about pandas, for the input part.
a list : alist=[(1616, 'glouglou'),(1956, 'tagada')]
a_dataframe : df =
first_name last_name year
0 William Shakespeare 1616
1 Bertold Brecht 1956

a valid (after the patch) pandasql :

sqldf("select * from df inner join alist on df.year=saints.c0", locals())

`get_outer_frame_variables()` fails in Jupyter Notebook

Function get_outer_frame_variables() fails in Jupyter Notbeook due to elements in inspect.getouterframes(inspect.currentframe()) are tuple and have no attribute filename.

Code related to this bug is here:
https://github.com/yhat/pandasql/blob/master/pandasql/sqldf.py#L96

Simple code to trigger the bug (in Jupyter Notebook):

import pandasql as psql
psql.sqldf("SELECT * FROM df_demo")    # assign df_demo to any simple DataFrame

Left Join functioning same as Inner Join

When running a Left join, rows in the left table without a match in the right table are excluded from results table. This emulates the expected results of an inner join, whereas in a left join, one would expect these excluded rows to remain with nulls for the values of the joined fields.

I've set up the following dummy tables as pandas data frames.

R = pd.DataFrame({'jid':[1,3,1,2,3], 'date':['2000-02-04','2000-01-05','2000-01-30'
,'2000-03-10','2000-04-28'], 'amount':[1,2,3,4,5]})
L = pd.DataFrame({'jid':[1,1,2,3], 'start': ['2000-01-01', '2000-01-02', '2000-03-01'
, '2000-05-01'], 'end': ['2000-01-31', '2000-02-28', '2000-03-31', '2000-05-31'] })

R.date = pd.to_datetime(R.date)
L.start = pd.to_datetime(L.start)
L.end = pd.to_datetime(L.end)

Table R

jid	date	amount
1	2000-02-04	1
3	2000-01-05	2
1	2000-01-30	3
2	2000-03-10	4
3	2000-04-28	5

Table L

jid	start	end
1	2000-01-01	2000-01-31
1	2000-02-01	2000-02-28
2	2000-03-01	2000-03-31
3	2000-05-01	2000-05-31

sqlLeft = """ SELECT L., R. FROM L LEFT JOIN R ON L.jid = R.jid WHERE R.date BETWEEN L.start AND L.end """
sqlin = """ SELECT L., R. FROM L INNER JOIN R ON L.jid = R.jid WHERE R.date BETWEEN L.start AND L.end """

resultsL = pandasql.sqldf( sqlLeft, locals() )
resultsin = pandasql.sqldf( sqlin, locals() )

Expected Results of Left Join

jid	start	end	jid	date	amount
1	2000-01-01	2000-01-31	1	2000-01-30	3
1	2000-02-01	2000-02-28	1	2000-02-04	1
2	2000-03-01	2000-03-31	2	2000-03-10	4
3	2000-05-01	2000-05-31	NaN	NaN	NaN

Actual Results of Left Join

jid	start	end	jid	date	amount
1	2000-01-01	2000-01-31	1	2000-01-30	3
1	2000-02-01	2000-02-28	1	2000-02-04	1
2	2000-03-01	2000-03-31	2	2000-03-10	4

Results of Inner Join

jid	start	end	jid	date	amount
1	2000-01-01	2000-01-31	1	2000-01-30	3
1	2000-02-01	2000-02-28	1	2000-02-04	1
2	2000-03-01	2000-03-31	2	2000-03-10	4

Common Table Expression in pandasql

Hello,

What do you think about including "Common Table Expressions" in pandasql ?

http://nbviewer.ipython.org/github/stonebig/baresql/blob/master/examples/cte_to_sqlite.ipynb

DataFrame column names that aren't valid SQLite3 column names?

I'm trying to query from dataframes that contain various "natural" column names (eg even as simple as names containing spaces") from found csv and excel files eg

pd.read_csv('http://www.iwight.com/documentlibrary/download/november-2013-transparency-data-csv')

and keep hitting naming convention errors:

q='''
SELECT * FROM df LIMIT 5;
'''
pysqldf(q)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-23-6f329a9debd3> in <module>()
      2 SELECT * FROM df LIMIT 5;
      3 '''
----> 4 pysqldf(q)

<ipython-input-5-6245773871e5> in <lambda>(q)
----> 1 pysqldf = lambda q: sqldf(q, globals())

/usr/local/lib/python3.4/dist-packages/pandasql/sqldf.py in sqldf(q, env, inmemory)
    113         df = env[table]
    114         df = _ensure_data_frame(df, table)
--> 115         _write_table(table, df, conn)
    116 
    117     try:

/usr/local/lib/python3.4/dist-packages/pandasql/sqldf.py in _write_table(tablename, df, conn)
     66             msg = "please follow SQLite column naming conventions: "
     67             msg += "http://www.sqlite.org/lang_keywords.html"
---> 68             raise Exception(msg)
     69 
     70     write_frame(df, name=tablename, con=conn, flavor='sqlite')

Exception: please follow SQLite column naming conventions: http://www.sqlite.org/lang_keywords.html

Is there a switch to mitigate against this?

One (probably overcomplicated) way I though of coping with this would be to:

generate a mapping from an original column name to eg a hashed colname that is guaranteed well formed for SQLIte3
used the hashed colname in the SQLite3 table
whenever a query is run, silently convert the 'raw' column name to hashed column name

insert into or update

We can use sqldf package in R and conduct update or insert into like this :

tablename <-sqldf(c("UPDATE tablename SET column1=value1,column2=value2,... WHERE some_column=some_value", "select * from tablename "))

tablename <-sqldf(c("insert into tablename (column1, column2,….) values (value1, value2,..)", "select * from tablename "))

Can we perform the same update or insert into the DataFrame using pandasql?

support of math functions?

I am trying to use function 'exp' in pandassql, but the error throwed:

PandaSQLException: (sqlite3.OperationalError) no such function: exp [SQL: 'select *, 1/(1+exp(-(tree0+tree1+tree2+tree3+tree4+tree5+tree6+tree7+tree8+tree9))) as score from res_df;']

dask.dataframe support?

Hi,

When I try to use pandasql with dask.dataframe, I got below error. Do you have any idea for using pandasql with dask?

s3fp="s3://bucket/.../xxx00103_1_prt_/xxx00103_1_prt_.csv.gz"

dfp = dd.read_csv(s3fp,
compression='gzip',
header=None,
sep=',',
quotechar='"',
blocksize=None,
dtype=object,
names=['A' + str(i) for i in range(60)]
)
ps.sqldf('select * from dfp limit 10;',globals())

AttributeError Traceback (most recent call last)
in ()
----> 1 ps.sqldf('select * from dfp limit 10;',globals())

C:\Miniconda2\envs\py27\lib\site-packages\pandasql\sqldf.pyc in sqldf(query, env, db_uri)
154 >>> sqldf("select avg(x) from df;", locals())
155 """
--> 156 return PandaSQL(db_uri)(query, env)

C:\Miniconda2\envs\py27\lib\site-packages\pandasql\sqldf.pyc in call(self, query, env)
56 continue
57 self.loaded_tables.add(table_name)
---> 58 write_table(env[table_name], table_name, conn)
59
60 try:

C:\Miniconda2\envs\py27\lib\site-packages\pandasql\sqldf.pyc in write_table(df, tablename, conn)
119 message='The provided table name '%s' is not found exactly as such in the database' % tablename)
120 to_sql(df, name=tablename, con=conn,
--> 121 index=not any(name is None for name in df.index.names)) # load index into db if all levels are named
122
123

C:\Miniconda2\envs\py27\lib\site-packages\dask\dataframe\core.pyc in getattr(self, key)
2246 elif key in self._dt_attributes:
2247 return getattr(self.dt, key)
-> 2248 raise AttributeError("'Index' object has no attribute %r" % key)
2249
2250 def dir(self):

AttributeError: 'Index' object has no attribute 'names'

selecting on large dataframes

I'm getting an OperationalError when selecting from a large dataframe (full error message at bottom). The error seems to be related to the size of the dataframe not the size of the selection.

import numpy as np
import pandas as pd
from pandasql import sqldf

# works
df = pd.DataFrame(np.random.rand(45000,5))
select = sqldf('SELECT * FROM df')

# fails
df = pd.DataFrame(np.random.rand(60000,5))
select = sqldf('SELECT * FROM df')

# works
df = pd.DataFrame(np.random.rand(60000,2))
select = sqldf('SELECT * FROM df')

#works
df = pd.DataFrame(np.random.rand(45000,5))
select = sqldf('SELECT "1", "2" FROM df')

# fails
df = pd.DataFrame(np.random.rand(60000,5))
select = sqldf('SELECT "1", "2" FROM df')

Error message

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1192                         parameters,
-> 1193                         context)
   1194         except BaseException as e:

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    506     def do_execute(self, cursor, statement, parameters, context=None):
--> 507         cursor.execute(statement, parameters)
    508 

OperationalError: too many SQL variables

The above exception was the direct cause of the following exception:

OperationalError                          Traceback (most recent call last)
<ipython-input-14-b81da7348892> in <module>()
      1 df = pd.DataFrame(np.random.randint(0,10,size=(60000,5)))
----> 2 select = sqldf('SELECT "1", "2" FROM df')

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandasql/sqldf.py in sqldf(query, env, db_uri)
    154     >>> sqldf("select avg(x) from df;", locals())
    155     """
--> 156     return PandaSQL(db_uri)(query, env)

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandasql/sqldf.py in __call__(self, query, env)
     56                     continue
     57                 self.loaded_tables.add(table_name)
---> 58                 write_table(env[table_name], table_name, conn)
     59 
     60             try:

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandasql/sqldf.py in write_table(df, tablename, conn)
    119                        message='The provided table name \'%s\' is not found exactly as such in the database' % tablename)
    120         to_sql(df, name=tablename, con=conn,
--> 121                index=not any(name is None for name in df.index.names))  # load index into db if all levels are named
    122 
    123 

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype)
    448     pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
    449                       index_label=index_label, schema=schema,
--> 450                       chunksize=chunksize, dtype=dtype)
    451 
    452 

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype)
   1147                          schema=schema, dtype=dtype)
   1148         table.create()
-> 1149         table.insert(chunksize)
   1150         if (not name.isdigit() and not name.islower()):
   1151             # check for potentially case sensitivity issues (GH7815)

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandas/io/sql.py in insert(self, chunksize)
    661 
    662                 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 663                 self._execute_insert(conn, keys, chunk_iter)
    664 
    665     def _query_iterator(self, result, chunksize, columns, coerce_float=True,

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
    636         """Insert data into this table with database connection"""
    637         data = [{k: v for k, v in zip(keys, row)} for row in data_iter]
--> 638         conn.execute(*self.insert_statement(data, conn))
    639 
    640     def insert(self, chunksize=None):

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params)
    946             raise exc.ObjectNotExecutableError(object)
    947         else:
--> 948             return meth(self, multiparams, params)
    949 
    950     def _execute_function(self, func, multiparams, params):

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
    267     def _execute_on_connection(self, connection, multiparams, params):
    268         if self.supports_execution:
--> 269             return connection._execute_clauseelement(self, multiparams, params)
    270         else:
    271             raise exc.ObjectNotExecutableError(self)

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
   1058             compiled_sql,
   1059             distilled_params,
-> 1060             compiled_sql, distilled_params
   1061         )
   1062         if self._has_events or self.engine._has_events:

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1198                 parameters,
   1199                 cursor,
-> 1200                 context)
   1201 
   1202         if self._has_events or self.engine._has_events:

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1411                 util.raise_from_cause(
   1412                     sqlalchemy_exception,
-> 1413                     exc_info
   1414                 )
   1415             else:

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    201     exc_type, exc_value, exc_tb = exc_info
    202     cause = exc_value if exc_value is not exception else None
--> 203     reraise(type(exception), exception, tb=exc_tb, cause=cause)
    204 
    205 if py3k:

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    184             value.__cause__ = cause
    185         if value.__traceback__ is not tb:
--> 186             raise value.with_traceback(tb)
    187         raise value
    188 

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1191                         statement,
   1192                         parameters,
-> 1193                         context)
   1194         except BaseException as e:
   1195             self._handle_dbapi_exception(

~/anaconda2/envs/urban_prac4/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    505 
    506     def do_execute(self, cursor, statement, parameters, context=None):
--> 507         cursor.execute(statement, parameters)
    508 
    509     def do_execute_no_params(self, cursor, statement, context=None):

OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: 'INSERT INTO df ("0", "1", "2", "3", "4") VALUES (?, ?, ?, ?, ?
...
...
(Background on this error at: http://sqlalche.me/e/e3q8)

Cannot handle quotes around table name

Even though the following is valid sql, pandasql gives an error:

import pandasql as pdsql
df1 = pd.DataFrame({
    'name': ['Joe', 'Bill', 'Bob'],
    'weight': [70, 85, 75]
})

q = 'select * from "df1"'
print(pdsql.sqldf(q, locals()))

PandaSQLException: (sqlite3.OperationalError) no such table: df1 [SQL: 'select * from "df1"']

Removing the quotes around the table name fixes this. This makes me slightly nervous about pandasql's handling of quotes and backticks :|

pip install error

Trying to install pandasql, I'm getting the following error:

Downloading/unpacking pandasql

  Running setup.py egg_info for package pandasql

    running egg_info
    writing requirements to pip-egg-info/pandasql.egg-info/requires.txt
    writing pip-egg-info/pandasql.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/pandasql.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/pandasql.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/pandasql.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'pip-egg-info/pandasql.egg-info/SOURCES.txt'
Downloading/unpacking sqlparse==0.1.6 (from pandasql)

  Running setup.py egg_info for package sqlparse

    running egg_info
    writing pip-egg-info/sqlparse.egg-info/PKG-INFO
    writing top-level names to pip-egg-info/sqlparse.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/sqlparse.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found


    reading manifest file 'pip-egg-info/sqlparse.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'pip-egg-info/sqlparse.egg-info/SOURCES.txt'
  Source in ./venv/build/sqlparse has the version 0.1.11, which does not match the requirement sqlparse==0.1.6 (from pandasql)

Source in ./venv/build/sqlparse has version 0.1.11 that conflicts with sqlparse==0.1.6 (from pandasql)

Exception information:
Traceback (most recent call last):
  File "/home/mihi/tug/venv/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/basecommand.py", line 104, in main
    status = self.run(options, args)
  File "/home/mihi/tug/venv/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/commands/install.py", line 245, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/home/mihi/tug/venv/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1014, in prepare_files
    req_to_install.assert_source_matches_version()
  File "/home/mihi/tug/venv/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 366, in assert_source_matches_version
    % (display_path(self.source_dir), version, self))
InstallationError: Source in ./venv/build/sqlparse has version 0.1.11 that conflicts with sqlparse==0.1.6 (from pandasql)

error with large panda dataframe

Hi All,

Code:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
q = """SELECT t.bidder_id,t.payment_account,t.address,t.outcome,count(t.bid_id) AS bid_ct,count(t.auction) AS auction_ct,COUNT(t.merchandise) AS merchandise_ct,COUNT(t.device) AS device_ct,COUNT(t.country) AS country_ct,COUNT(t.ip) AS ip_ct,COUNT(t.url) AS url_ct,COUNT(DISTINCT t.bid_id) AS bid_ct_uq,COUNT(DISTINCT t.auction) AS auction_ct_uq,COUNT(DISTINCT t.merchandise) AS merchandise_ct_uq,COUNT(DISTINCT t.device) AS device_ct_uq,COUNT(DISTINCT t.country) AS country_ct_uq,COUNT(DISTINCT t.ip) AS ip_ct_uq,COUNT(DISTINCT t.url) AS url_ct_uq FROM train t GROUP BY t.bidder_id,t.payment_account,t.address,t.outcome ;"""
tt=pysqldf(q)
tt.shape

Comments

Here train is a panda dataframe with 3071253 rows and 12 columns
The above code errors out when the with error "'NoneType' object has no attribute 'shape' meaning that the object tt is none.
While the same code runs successfully when the train dataframe has few data say 10000 rows and 12 columns.
Can somebody point out what is possibly wrong here . The code works fine and returns dataframe when the data is less but when there is more data it errors out.

data frames being written every time?

Correct me if I'm wrong but this code seems to write the data frames I want to query to memory (or a temp db called .pandasql.db) every time I run the query. That may not be optimal if I'm dealing with big data frames.

Any plans to save the data frame(s) once and query many times?
(would you accept contributions on the repository if I were to implement that?

New Feature: include HDFrame in SQL

Ability to run SQL on Pandas HDFstore (disk store) would be very useful.
HDFStore has some ability to filter data, so Pandas filtering can be used here.

We can do Out of RAM SQL Query.

Query list of dictionaries

Here's what it should do:

data = [{"a": 1, "b": 2}, {"a": 5, "b": 10}, {"a": 20, "b": 30}]
sqldf("select a, b from data", locals())

SELECT from WHERE ... IN ( SELECT ..) doesn't work?

The following seems to work fine (apart from the extra index column raised as a separated issue):

import pandas as pd
#Set up a dummy dataframe
courseData= { 'courseCode': ['TM351','TU100','M269'],
              'points':[30,60,30],
              'level':['3','1','2']
             }
course_df=pd.DataFrame( courseData )

q='''
    SELECT * FROM course_df WHERE courseCode IN ( 'TM351', 'TU100' ) ;
  '''
pysqldf(q)

    index   courseCode  level   points
0    0   TM351   3   30
1    1   TU100   1   60

And I can nest a query into the same dataframe:

q='''
    SELECT * FROM course_df WHERE courseCode IN ( SELECT DISTINCT courseCode FROM course_df WHERE level >=2 ) ;
  '''
pysqldf(q)

    index   courseCode  level   points
0    0   TM351   3   30
1    2   M269    2   30

but if I try to select into a different dataframe, even returning a similar column name, it fails?

programData= { 'courseCode': ['TM351','TM351','TM351','TU100','TU100','TU100','M269','M269','M269'],
              'programCode':['AB1','AB2','AB3','AB1','AB3','AB4','AB3','AB4','AB5']
             }
program_df=pd.DataFrame( programData )

q='''
    SELECT DISTINCT courseCode FROM program_df WHERE programCode='AB1' OR programCode='AB2';
  '''
pysqldf(q)

    courseCode
0    TM351
1    TU100


q='''
    SELECT * FROM course_df WHERE courseCode IN ( SELECT DISTINCT courseCode FROM program_df ) ;
  '''
pysqldf(q)

#returns nothing?

Or have I made a mistake in my SQL?

Request: use of params to prevent SQL-injection

If using pandas with a sqlite3 backend, I'm able to mitigate SQL-injection (I believe) through the use of params, something like:

import pandas as pd
df = pd.read_sql_query("SELECT y FROM tbl WHERE x LIKE ?", dbcon, params=("A%",))

Is there a way to do this directly in pandasql? It may be as simple as passing it directly. Thoughts?

Something like:

import pandas as pd
import pandasql
df = pd.DataFrame({"x":range(10), "y":range(10)})
pandasql.sqldf("SELECT y FROM df WHERE x = ?", params=(4,), globals())

I've also asked on stackoverflow, in case I missed docs or other ideas.

Thanks!

Operational Error: Too many SQL Variables

I get the error "OperationalError: (sqlite3.OperationalError) too many SQL variables" even if I use a single column in SQL like sqldf("Select colA from df1"). I am unable to comprehend. Please help. THanks.

Support for multiple backends

Hi,

much like with R's sqldf, it would be nice to have support for backends other than sqlite.
Would you be interested in adding this capability?

A use case would be working with data that does not fit in memory, or queries that produce (intermediate) result sets that do not fit in memory. It is also a way to enable out of core computation by pushing analytical operations to the database engine. A workflow for this would be:

load data into a Pandas DataFrame
run Postgresql UDFs (eg. http://madlib.net) on the DataFrame, that generate an aggregate result set
The post processed result set is available as a DataFrame in Pandas.

This can be done relatively easily with Pandas to_sql/read_sql, but I think pandasql could be a good glue
to ease this type of scenario.

I have a proof of concept at https://github.com/gmodena/pandasql/tree/sqlalchemy that uses SQLAlchemy to extend support to other databases, while retaining compatibility with the current API (dbapi+sqlite).

Might you be interested in reviewing this code, I'd be more than happy to discuss further and eventually open a pull request (modulo fixes and adjustments).

Handling exceptions in Python 3

Installing from github on Python 3 I get:

Installing collected packages: pandasql
  Running setup.py install for pandasql
      File "/usr/local/lib/python3.4/dist-packages/pandasql/sqldf.py", line 115
        except Exception, e:
                        ^
    SyntaxError: invalid syntax

Is this a py2to3 syntax thing?

Issues importing pandasql - Python 3.3 in Windows 7

Just wish to bring this to your attention.

I am using Python 3.3 in Windows 7, and have installed pandasql using "pip install -U pandasql".

I had troubles importing pandasql and the error message was "no module named sqldf".

At the end, I managed to resolve it by copying sqldf.py from C:\Python33\Lib\site-packages\pandasql-0.3.1-py3.3.egg\pandasql into C:\Python33\Lib in order to make pandasql work.

How to set text_factory to handle non ascii byte codes

For an input dataset with some international data i am seeing very slow performance and then the following error was emitted:

(sqlite3.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str) It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'INSERT INTO pkgdftail ...

Is there a configuration and/or workaround for this?

Docs for PyPi need work

https://pypi.python.org/pypi/pandasql

It looks really bad

Bug in the extract table names function.

Consider the query "SELECT idd FROM df where Subject = "Foreign Exchange from a US entity";".
This will raise an exception "a not found". The extract table names module considers "a" as a table as it is present after "from" in the text.

API: pandas 0.14 will have some sql API changes

see the master issue here:

pandas-dev/pandas#6292

should be mostly backward compatible. but letting you know

will support multi-engines thru SQLAlchemy; you may need to wrap your raw sql calls a bit.

PandaSQL won't run inside main()

Hi there,

I noticed that when I attempt to call pandasql inside main(), the script throws an error, no matter if I use globals() or locals(). I also tried to define qdf() outside main(), to no avail.

Code

# encoding=utf-8

# pandasql_test.py

# description:
# test for pandasql functionality
# inside a main() function

# imports
import pandas as pd
import pandasql as psql

# main function
def main():

    # abbreviate pandasql functionality
    qdf = lambda q: psql.sqldf(q, locals())

    # create data
    df = pd.DataFrame({
        'name': ['Joe', 'Bill', 'Bob'],
        'weight': [70, 85, 75]
    })

    q = 'select * from df'
    print qdf(q)

if __name__ == '__main__':
    main()

Error
pandasql.sqldf.PandaSQLException: (sqlite3.OperationalError) no such table: df [SQL: 'select * from df']

Thanks in advance for any advice!

Ubuntu 14.04

I'm unable to import pandasql

Traceback (most recent call last):
File "/home/lucia/proyecto NO2/funcion.py", line 1, in
from pandasql import sqldf, load_meat, load_births
File "/usr/local/lib/python2.7/dist-packages/pandasql-0.6.2-py2.7.egg/pandasql/init.py", line 1, in
from .sqldf import sqldf
File "/usr/local/lib/python2.7/dist-packages/pandasql-0.6.2-py2.7.egg/pandasql/sqldf.py", line 4, in
from pandas.io.sql import to_sql, read_sql
ImportError: cannot import name to_sql

"Where" in query doesn't work in pandasql

Your code here

aadhaar_data = pd.read_csv('/Users/philips/Downloads/aadhaar_data.csv')
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True) #rename column
query = "SELECT pin_code from aadhaar_data where state = 'Gujarat' "
aadhar_queried_data = sqldf(query.lower(), locals())
print(aadhar_queried_data)

Problem description

When i user normal query it returns a dataframe , But when i put an condition in a query like Where it returns an empty array

Empty DataFrame
Columns: [pin_code]
Index: []

Data set # https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/aadhaar_data.csv

except Exception, e:

When I try to import pandasql the system prints this error:

from pandasql import sqldf
  File "/opt/python34/lib/python3.4/site-packages/pandasql/__init__.py", line 1, in <module>
    from .sqldf import sqldf
  File "/opt/python34/lib/python3.4/site-packages/pandasql/sqldf.py", line 117
    except Exception, e:
                    ^
SyntaxError: invalid syntax

If I remove the ', e', the library works well.

Version INSTALLED: 0.6.1 (latest)

giving the possibility to use sqlite ":memory:" as pandasql back-end

Hello,

Pandasql is a really nice piece of code.
If I understood well, it uses a temporary sqlite database "on disk".

Would it be possible to add the option to use sqlite in ":memory:" rather than "on disk " ?
Is there a hidden problem with ":memory" that prevented you to default to it ?

Table 'meat' already exists

Hello,

I am quite interested in this library, since I use pandas in my workflow every day. However, I come across the following error when running the basic tutorial code on the front page:

PS Y:\> ipython
Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from pandasql import sqldf, load_meat, load_births

In [2]: pysqldf = lambda q: sqldf(q, globals())

In [3]: meat = load_meat()

In [4]: births = load_births()

In [5]: print pysqldf("SELECT * FROM meat LIMIT 10;").head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-0c7b33a2459f> in <module>()
----> 1 print pysqldf("SELECT * FROM meat LIMIT 10;").head()

<ipython-input-2-6245773871e5> in <lambda>(q)
----> 1 pysqldf = lambda q: sqldf(q, globals())

C:\Python27\lib\site-packages\pandasql\sqldf.pyc in sqldf(q, env)
     60                 raise Exception("%s not found" % table)
     61             df = env[table]
---> 62             _write_table(table, df, conn)
     63         try:
     64             result = frame_query(q, conn)

C:\Python27\lib\site-packages\pandasql\sqldf.pyc in _write_table(tablename, df, conn)
     32             raise Exception(msg)
     33
---> 34     write_frame(df, name=tablename, con=conn, flavor='sqlite')
     35
     36

C:\Python27\lib\site-packages\pandas\io\sql.pyc in write_frame(frame, name, con, flavor, if_exists, **kwargs)
    185     exists = table_exists(name, con, flavor)
    186     if if_exists == 'fail' and exists:
--> 187         raise ValueError, "Table '%s' already exists." % name
    188
    189     #create or drop-recreate if necessary

ValueError: Table 'meat' already exists.

In [6]:

I am running Python 2.7 64 bit on Windows 7. My pandas version is 0.10.1 and all the dependencies are up to date:

PS Y:\> pip install -U pandas
Requirement already up-to-date: pandas in c:\python27\lib\site-packages
Requirement already up-to-date: python-dateutil in c:\python27\lib\site-packages (from pandas)
Requirement already up-to-date: pytz in c:\python27\lib\site-packages (from pandas)
Requirement already up-to-date: numpy>=1.6.1 in c:\python27\lib\site-packages (from pandas)
Requirement already up-to-date: six in c:\python27\lib\site-packages (from python-dateutil->pandas)
Cleaning up...

Any ideas of what I can try doing differently?

SELECT * returns a dataframe with an explicit Index column?

If I try to select all columns from a dataframe with SELECT *, the dataframe I get in return includes a new, explicitly labelled, index column?

Nested SELECT doesn't work?

Nested SELECTs don't appear to be recognised - I assume the bracketed term is not being passed as a literal query, rather the pandasql parser is trying to look it up as a variable?

directorates=['Community Wellbeing & Social Care',
              'Childrens Services',
              'Economy & Environment',
              'Resources',
              'Corporate']
expenseTypes=['Accommodation Costs',
              'Payment to Private Contractors',
              'Operational Equipment',
              'Professional Services']
import itertools
a=list(itertools.product(directorates,expenseTypes))
unzipa=[t for t in zip(*a)]
df_long=pd.DataFrame({'directorates':unzipa[0],
                      'expense_types':unzipa[1],
                      'total':np.random.randint(0,20000,len(directorates)*len(expenseTypes))
                      })

import pandas as pd
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

q='''
SELECT directorates FROM (SELECT * FROM df_long LIMIT 3) tbl;
'''
pysqldf(q)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-15-7c9285cfa654> in <module>()
      2 SELECT directorates FROM (SELECT * FROM df_long LIMIT 3) tbl;
      3 '''
----> 4 pysqldf(q)

<ipython-input-5-6a58ed7de60b> in <lambda>(q)
      1 import pandas as pd
      2 from pandasql import sqldf
----> 3 pysqldf = lambda q: sqldf(q, globals())

/usr/local/lib/python3.4/dist-packages/pandasql/sqldf.py in sqldf(q, env, inmemory)
    116             if not inmemory :
    117                 os.remove(dbname)
--> 118             raise Exception("%s not found" % table)
    119         df = env[table]
    120         df = _ensure_data_frame(df, table)

Exception: (SELECT * FROM df_long LIMIT 3) not found

New Feature Request: support Redshift syntax

Hey,
Great job with this! have been searching for something exactly like this...
would have been perfect if pandasql would support redshift query syntax. Any chance this is on the roadmap?

thanks,
Dan

SELECT FROM multiple tables

As far as I understand (pardon my beginner knowledge), sqlite3 supports select from multiple tables. That means that I should be able to simplify this pandasql search:

q = """
SELECT * FROM measurements
WHERE
measurements.repetitions > 120 AND
measurements.sampleID IN
(SELECT samples.sampleID FROM samples
WHERE
samples.localResponsible = "brian"
);
"""

to:

q = """
SELECT * FROM measurements, samples
WHERE
measurements.repetitions > 120 AND
measurements.sampleID = samples.sampleID AND
samples.localResponsible = "brian";
"""

(see, for example: http://sql.learncodethehardway.org/book/ex8.html )
However, the latter does not work, it appears I cannot use multiple tables in a "FROM" statement in pandasql. Is this a bug or missing feature in pandasql?

Error querying dataframes with a DatetimeIndex

There appears to be an issue when attempting to query dataframes containing a pandas.DatetimeIndex either as a (row) index or as a separate column. However, querying dataframes with regular pandas.TimeSeries columns is fine.

Running the snippet:

import pandas as pd
import pandasql

ts = pd.TimeSeries(["2014-01-01", "2014-01-02", "2014-01-03"])
df = pd.DataFrame({ "date" : ts, "x" : range(len(ts)) })
pandasql.sqldf("SELECT * FROM df", globals()) # OK

# Convert the time series to a DatetimeIndex for row indexing.
df = pd.DataFrame({ "x" : range(len(ts)) })
df.index = pd.DatetimeIndex(ts)
pandasql.sqldf("SELECT * FROM df", globals()) # Error

# Use the same DatetimeIndex as a separate column.
df = pd.DataFrame({ "date" : pd.DatetimeIndex(ts), "x" : range(len(ts)) })
pandasql.sqldf("SELECT * FROM df", globals()) # Error

produces the error:

Traceback (most recent call last):
  File "./pandasql_bug.py", line 14, in <module>
    pandasql.sqldf("SELECT * FROM df", globals()) # Error
  File "/home/kian/.local/lib/python2.7/site-packages/pandasql-0.5.1-py2.7.egg/pandasql/sqldf.py", line 126, in sqldf
    _write_table(table, df, conn)
  File "/home/kian/.local/lib/python2.7/site-packages/pandasql-0.5.1-py2.7.egg/pandasql/sqldf.py", line 76, in _write_table
    to_sql(df, name=tablename, con=conn, flavor='sqlite')
  File "/home/kian/.local/lib/python2.7/site-packages/pandas/io/sql.py", line 440, in to_sql
    index_label=index_label)
  File "/home/kian/.local/lib/python2.7/site-packages/pandas/io/sql.py", line 1061, in to_sql
    table.insert()
  File "/home/kian/.local/lib/python2.7/site-packages/pandas/io/sql.py", line 929, in insert
    cur.executemany(ins, data_list)
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

Similar errors were reported in issues #26 and #30 with the meat dataset (probably because of the meat["date"] column data type?).

Thanks.

pandasql does not work in a function

Hi,
I made a function like the one below to query a panda dataframe but it tells me
'Exception: UNECE_csv not found'.
Cannot understand where is the prob since if I run the same scripts outside the function it works perfectly.

def import_csv_and_query():

filepath='C:\\DATA\\Work\\.....csv
import pandas as pd
from pandasql import sqldf
UNECE_csv=pd.read_csv(filepath,dtype=object,na_filter=False)
q="""

MY LONG QUERY
"""

UNECE_clean=sqldf(q, globals())

UNECE_colums=UNECE_clean.columns.tolist()
UNECE_array=UNECE_clean.values.tolist()
UNECE=[UNECE_colums]+UNECE_array   
return UNECE

use to_sql instead of deprecated write_frame

Looks like write_frame is deprecated, perhaps to_sql should be used instead?

SELECT * returns a dataframe with an explicit Index column?

If I try to select all columns from a dataframe with SELECT *, the dataframe I get in return includes a new, explicitly labelled, index column?

Error with Pandasql Using Anaconda on OSX

I apologize in advance if I'm not following GitHub protocol. I'm relatively new to GitHub.

I tried following your example on (http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html) and got the error below from iPython Notebook.

I'm using the Anaconda distribution of Python/iPython Notebook on a Mac running OSX. I tried searching Stack Overflow and couldn't find a solution to the problem.

Thank you so much for the help. I'm very keen on using pandasql

SCRIPT FROM EXAMPLE:

import matplotlib.pyplot as plt
from pandasql import *
import pandas as pd

pysqldf = lambda q: sqldf(q, globals())

q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""

meat = load_meat()
births = load_births()

df = pysqldf(q)
df.births = df.births.fillna(method='backfill')

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(pd.rolling_mean(df['beef'], 12), color='b')
ax1.set_xlabel('months since 1975')
ax1.set_ylabel('cattle slaughtered', color='b')

ax2 = ax1.twinx()
ax2.plot(pd.rolling_mean(df['births'], 12), color='r')
ax2.set_ylabel('babies born', color='r')
plt.title("Beef Consumption and the Birth Rate")
plt.show()

ERROR MESSAGE:

InterfaceError Traceback (most recent call last)
in ()
13 m.date > '1974-12-31';
14 """
---> 15 df = pysqldf(q)
16 df.births = df.births.fillna(method='backfill')
17

in (q)
----> 1 pysqldf = lambda q: sqldf(q, globals())
2 q = """
3 SELECT
4 m.date
5 , m.beef

/usr/local/lib/python2.7/site-packages/pandasql/sqldf.pyc in sqldf(q, env, inmemory)
113 df = env[table]
114 df = _ensure_data_frame(df, table)
--> 115 _write_table(table, df, conn)
116
117 try:

/usr/local/lib/python2.7/site-packages/pandasql/sqldf.pyc in _write_table(tablename, df, conn)
68 raise Exception(msg)
69
---> 70 write_frame(df, name=tablename, con=conn, flavor='sqlite')
71
72

/Users/danielsack/anaconda/lib/python2.7/site-packages/pandas/io/sql.pyc in write_frame(frame, name, con, flavor, if_exists, **kwargs)
236 if func is None:
237 raise NotImplementedError
--> 238 func(frame, name, safe_names, cur)
239 cur.close()
240 con.commit()

/Users/danielsack/anaconda/lib/python2.7/site-packages/pandas/io/sql.pyc in _write_sqlite(frame, table, names, cur)
252 else:
253 data = [tuple(x) for x in frame.values.tolist()]
--> 254 cur.executemany(insert_query, data)
255
256

InterfaceError: Error binding parameter 0 - probably unsupported type.

Getting error with demo code.

Running the demo code I get this
In [2]: from pandasql import sqldf, load_meat, load_births

pysqldf = lambda q: sqldf(q, globals())
meat = load_meat()
births = load_births()
print pysqldf("SELECT * FROM meat LIMIT 10;").head()

In [3]: >>> pysqldf = lambda q: sqldf(q, globals())

In [4]: >>> meat = load_meat()

In [5]: >>> births = load_births()

In [6]: >>> print pysqldf("SELECT * FROM meat LIMIT 10;").head()

InterfaceError Traceback (most recent call last)
in ()
----> 1 print pysqldf("SELECT * FROM meat LIMIT 10;").head()

in (q)
----> 1 pysqldf = lambda q: sqldf(q, globals())

/Users/tobie/anaconda/lib/python2.7/site-packages/pandasql/sqldf.pyc in sqldf(q, env, inmemory)
124 df = env[table]
125 df = _ensure_data_frame(df, table)
--> 126 _write_table(table, df, conn)
127
128 try:

/Users/tobie/anaconda/lib/python2.7/site-packages/pandasql/sqldf.pyc in _write_table(tablename, df, conn)
74 raise Exception(msg)
75
---> 76 to_sql(df, name=tablename, con=conn, flavor='sqlite')
77
78

/Users/tobie/anaconda/lib/python2.7/site-packages/pandas/io/sql.pyc in to_sql(frame, name, con, flavor, if_exists, index, index_label)
438
439 pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
--> 440 index_label=index_label)
441
442

/Users/tobie/anaconda/lib/python2.7/site-packages/pandas/io/sql.pyc in to_sql(self, frame, name, if_exists, index, index_label)
1059 name, self, frame=frame, index=index, if_exists=if_exists,
1060 index_label=index_label)
-> 1061 table.insert()
1062
1063 def has_table(self, name):

/Users/tobie/anaconda/lib/python2.7/site-packages/pandas/io/sql.pyc in insert(self)
927
928 cur = self.pd_sql.con.cursor()
--> 929 cur.executemany(ins, data_list)
930 cur.close()
931 self.pd_sql.con.commit()

InterfaceError: Error binding parameter 1 - probably unsupported type.

sqldf returning None if SELECT clause is not *

Hi Greg,

I cloned the most recent versions of pandas and pandasql today from GitHub and am noticing some fishy behavior in the output of my sqldf select statements: When I use anything other than '*' as my SELECT clause, sqldf is returning None.

# sqldf_select_issue.py
import pandas as pd
from pandasql import sqldf as _sqldf

def sqldf(query):
    return _sqldf(query,globals())

msg_data = pd.read_csv('sample_message_text.tsv',index_col=None)[['response','init_gender','thread_count']]

tests = []

tests.append("""
select *
from msg_data
limit 10""")

tests.append("""
select init_gender
from msg_data
limit 10""")

tests.append("""
select *
from msg_data
where init_gender = 'male'
limit 10""")

tests.append("""
select sum(thread_count) as total_threads
from msg_data""")

for query in tests:
    print("~"*30)
    output = sqldf(query)
    print(output)
    print(type(output))

Here is the output I get then running the above code:

Ravdek@RAVDEK-WIN7 ~/python_files/message_text_stats
$ python sqldf_select_issue.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   response init_gender  thread_count
0         0      female         74022
1         0      female            24
2         0      female             6
3         0      female             1
4         0      female             2
5         0      female             2
6         0      female             1
7         0      female             2
8         0      female            18
9         0        male        774894
<class 'pandas.core.frame.DataFrame'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
None
<type 'NoneType'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   response init_gender  thread_count
0         0        male        774894
1         0        male           237
2         0        male             1
3         0        male           192
4         0        male            10
5         0        male             4
6         0        male            59
7         0        male           198
8         0        male           435
9         0        male            47
<class 'pandas.core.frame.DataFrame'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
None
<type 'NoneType'>

If it's at all relevant, I'm working with Cygwin on Windows.

yhat / pandasql Goto Github PK

pandasql's Introduction

pandasql

Installation

Basics

Querying

pandasql's People

Contributors

Stargazers

Watchers

Forkers

pandasql's Issues

Code:

Comments

Your code here

ERROR MESSAGE:

In [6]: >>> print pysqldf("SELECT * FROM meat LIMIT 10;").head()

Recommend Projects

Recommend Topics

Recommend Org