cloudera / impyla Goto Github PK

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)

License: Apache License 2.0

Python 68.08% Shell 2.24% Thrift 29.68%

impyla's Introduction

impyla

Python client for HiveServer2 implementations (e.g., Impala, Hive) for distributed query engines.

For higher-level Impala functionality, including a Pandas-like interface over distributed data sets, see the Ibis project.

Features

HiveServer2 compliant; works with Impala and Hive, including nested data
Fully DB API 2.0 (PEP 249)-compliant Python client (similar to sqlite or MySQL clients) supporting Python 2.6+ and Python 3.3+.
Works with Kerberos, LDAP, SSL
SQLAlchemy connector
Converter to pandas DataFrame, allowing easy integration into the Python data stack (including scikit-learn and matplotlib); but see the Ibis project for a richer experience

Dependencies

Required:

Python 2.7+ or 3.5+
six, bitarray
thrift==0.16.0
thrift_sasl==0.4.3

Optional:

kerberos>=1.3.0 for Kerberos over HTTP support. This also requires Kerberos libraries to be installed on your system - see System Kerberos
pandas for conversion to DataFrame objects; but see the Ibis project instead
sqlalchemy for the SQLAlchemy engine
pytest for running tests; unittest2 for testing on Python 2.6

System Kerberos

Different systems require different packages to be installed to enable Kerberos support in Impyla. Some examples of how to install the packages on different distributions follow.

Ubuntu:

apt-get install libkrb5-dev krb5-user

RHEL/CentOS:

yum install krb5-libs krb5-devel krb5-server krb5-workstation

Installation

Install the latest release with pip:

pip install impyla

For the latest (dev) version, install directly from the repo:

pip install git+https://github.com/cloudera/impyla.git

or clone the repo:

git clone https://github.com/cloudera/impyla.git
cd impyla
python setup.py install

Running the tests

impyla uses the pytest toolchain, and depends on the following environment variables:

export IMPYLA_TEST_HOST=your.impalad.com
export IMPYLA_TEST_PORT=21050
export IMPYLA_TEST_AUTH_MECH=NOSASL

To run the maximal set of tests, run

cd path/to/impyla
py.test --connect impala

Leave out the --connect option to skip tests for DB API compliance.

To test impyla with different Python versions tox can be used. The commands below will run all impyla tests with all supported and installed Python versions:

cd path/to/impyla
tox

To filter environments / tests use -e and pytest arguments after --:

tox -e py310 -- -ktest_utf8_strings

Usage

Impyla implements the Python DB API v2.0 (PEP 249) database interface (refer to it for API details):

from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050) # auth_mechanism='PLAIN' for unsecured Hive connection, see function doc
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description  # prints the result set's schema
results = cursor.fetchall()

The Cursor object also exposes the iterator interface, which is buffered (controlled by cursor.arraysize):

cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
    print(row)

Furthermore the Cursor object returns you information about the columns returned in the query. This is useful to export your data as a csv file.

import csv

cursor.execute('SELECT * FROM mytable LIMIT 100')
columns = [datum[0] for datum in cursor.description]
targetfile = '/tmp/foo.csv'

with open(targetfile, 'w', newline='') as outcsv:
    writer = csv.writer(outcsv, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL, lineterminator='\n')
    writer.writerow(columns)
    for row in cursor:
        writer.writerow(row)

You can also get back a pandas DataFrame object

from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example

How do I contribute code?

You need to first sign and return an ICLA and CCLA before we can accept and redistribute your contribution. Once these are submitted you are free to start contributing to impyla. Submit these to [email protected].

Find

We use Github issues to track bugs for this project. Find an issue that you would like to work on (or file one if you have discovered a new issue!). If no-one is working on it, assign it to yourself only if you intend to work on it shortly.

It's a good idea to discuss your intended approach on the issue. You are much more likely to have your patch reviewed and committed if you've already got buy-in from the impyla community before you start.

Fix

Now start coding! As you are writing your patch, please keep the following things in mind:

First, please include tests with your patch. If your patch adds a feature or fixes a bug and does not include tests, it will generally not be accepted. If you are unsure how to write tests for a particular component, please ask on the issue for guidance.

Second, please keep your patch narrowly targeted to the problem described by the issue. It's better for everyone if we maintain discipline about the scope of each patch. In general, if you find a bug while working on a specific feature, file a issue for the bug, check if you can assign it to yourself and fix it independently of the feature. This helps us to differentiate between bug fixes and features and allows us to build stable maintenance releases.

Finally, please write a good, clear commit message, with a short, descriptive title and a message that is exactly long enough to explain what the problem was, and how it was fixed.

Please create a pull request on github with your patch.

impyla's People

Contributors

Stargazers

Watchers

Forkers

laserson mindis prateek ishaan mhworth mariusvniekerk ayousufi gdtm86 mrocklin imlqw carlotorniai wesm fkaufer faruto mjacobs darkseed chriscate awleblang sc750531323 themailman05 szehon nonsleepr lskuff bhtucker dhecht vamseeyarla haiyang1987 nandateja jiajie999 kentchun33333 intellifora tempbottle caseyching xiaomei-data mucca keiichishima raynes willwangcn thezedd mikeengland karamazi danfrankj haydenmarchant williamren thuunivercity marshall245 bquinart mbrukman mdamoemc alyoshark groscalin liujinliang99 tsvgmi chenwgen sekikn gittify chiaolun schaffino levylll flacomenoide sundeepn ekhtiar florianwilhelm shabda pecota-luizalabs jbymsolo yimian peterliang2015 cuijunyao sandysuehe korhner jet2007 luffychan888 qiujiaxin86 andrewgross udaysantosh djrlj694 dom0watt mndrake manuvaldes tdhopper boristyukin wanjingjjj matthewwardrop joelschmidt1 yong93 knowledgehacker tovin-xu dahaoshen fatelei joshrosen tsabran patecone devinstevenson dimaspivak wuyl-matt idreamshen nrghosh stephenan jirislav

impyla's Issues

Add Python 3 support

`from_pandas` fails if you manually specify a table name

Raise errors on NOT NULL and TEXT schemas

If I naively create a SQLAlchemy table using TEXT types or types that are constrained to NOT NULL then I get an error downstream from Impala itself. Perhaps these could be raised earlier in the Python layer with cleaner error messages. Presumably there are other such restrictions for which intuitive errors would aid usability by non-experts.

In some cases it might make sense just to do a conversion (e.g. TEXT to STRING), possibly raising a warning.

SHOW PARTITIONS queries return None for partitions.

My team has noticed a bug in processing SHOW PARTITIONS queries.

When executing the query in impala-shell, these queries work as expected:

[our.impala.server.url:21000] > show partitions dcollins.table_name;
Query: show partitions dcollins.table_name
+-------+-------+-----+-------+--------+--------+--------------+---------+
| year  | month | day | #Rows | #Files | Size   | Bytes Cached | Format  |
+-------+-------+-----+-------+--------+--------+--------------+---------+
| 2014  | 9     | 25  | -1    | 1      | 3.94MB | NOT CACHED   | PARQUET |
| Total |       |     | -1    | 1      | 3.94MB | 0B           |         |
+-------+-------+-----+-------+--------+--------+--------------+---------+

However, executing the same query through the impyla API, the partition columns (year, month, day) all come back as None.

conn = impala.dbapi.connect(host=our.impala.server.url, port=21050)
cursor = conn.cursor()
cursor.execute('show partitions dcollins.table_name')
for row in cursor:
    print row

# prints:
# (None, None, None, -1, 1, '3.94MB', 'NOT CACHED', 'PARQUET')
# (None, None, None, -1, 1, '3.94MB', '0B', '')
# should print:
# (2014, 9, 25, -1, 1, '3.94MB', 'NOT CACHED', 'PARQUET')
# ('Total', None, None, -1, 1, '3.94MB', '0B', '')
# or something like that.

From digging through the code, it appears that the data is already broken when coming back in the Thrift response so I have been unable to fix this issue myself. This is a problem for our team because we have not been able to figure out how to reliably determine what partitions have been added to Impala.

We see the same behavior running SHOW PARTITIONS on any of our tables but for reference the Impala Table in this example looks like (sensative names removed):

+----------------------------------------------------------------------------+
| result                                                                     |
+----------------------------------------------------------------------------+
| CREATE TABLE dcollins.table_name (                                         |
|   column_1 TIMESTAMP,                                                      |
|   column_2 STRING,                                                         |
|   column_3 STRING,                                                         |
|   column_4 STRING                                                          |
| )                                                                          |
| PARTITIONED BY (                                                           |
|   year INT,                                                                |
|   month INT,                                                               |
|   day INT                                                                  |
| )                                                                          |
| STORED AS PARQUET                                                          |
| LOCATION 'hdfs://our.namenode.url:8020/warehouse/dcollins.db/table_name'   |
| TBLPROPERTIES ('transient_lastDdlTime'='1412106677')                       |
+----------------------------------------------------------------------------+

EDIT: I have tried both the release 0.8.1 version and the latest git master.

Compile Impala types when creating SQLAlchemy table

After some encouragement from @mrocklin at PyCon, gave Impyla SQLAlchemy a try.
It seems the Impala types may not be registered with the SQLAlchemy type compiler.

Simple demo:

from impala.sqlalchemy import STRING, INT, FLOAT, TIMESTAMP
from sqlalchemy import Table, Column
from sqlalchemy.schema import MetaData

engine = sqlalchemy.engine.create_engine('impala://localhost')

metadata = MetaData(engine)

mytable = Table("mytable", metadata,
      Column('mytable_id', sqlalchemy.INT),
      Column('value', STRING)
)
mytable.create()

CompileError: (in table 'mytable', column 'value'): Compiler <sqlalchemy.sql.compiler.GenericTypeCompiler object at 0x1080d1810> can't render element of type <class 'impala.sqlalchemy.STRING'>

Impala keywords not quoted in SQLAlchemy

I have a table with a location column. Sadly the following SQL isn't valid in the Impala SQL dialect

SELECT location
from table_name

I suspect that there is some mechanism to specify keywords and have SQLAlchemy quote them appropriately.

math trigonometry functions not supported in UDFs?

I've managed to successfully ship the int_promotion UDF that is somewhere in the tests to Impala, but now I'm trying a more complex function and it does not work. This is the code I use to ship the UDF:

from impala.udf import udf, ship_udf
from impala.udf.types import (FunctionContext, DoubleVal)
from impala.context import ImpalaContext
import math

@udf(DoubleVal(FunctionContext, DoubleVal, DoubleVal, DoubleVal, DoubleVal))
def distance(context, lat1, long1, lat2, long2):

    R = 6378.1 # KMs

    # Convert latitude and longitude to
    # spherical coordinates in radians.
    degrees_to_radians = math.pi/180.0

    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - lat2)*degrees_to_radians

    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = long2*degrees_to_radians

    # Compute spherical distance from spherical coordinates.

    cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) +
           math.cos(phi1)*math.cos(phi2))
    arc = math.acos( cos )

    # Multiply arc by the radius of the earth
    # in your favorite set of units to get length.
    return arc * R

ctx = ImpalaContext(nn_host='ec2-xx-xx-xx-xx.eu-west-1.compute.amazonaws.com', host='ec2-xx-xx-xx-xx.eu-west-1.compute.amazonaws.com', hdfs_user='hdfs', webhdfs_port=50070)

ship_udf(ctx, int_promotion, overwrite=True, hdfs_path='/tmp/impala_functions/distance.ll', database='default')

ctx.close()

And this is the output I get:

Traceback (most recent call last):
  File "install_distance.py", line 6, in <module>
    @udf(DoubleVal(FunctionContext, DoubleVal, DoubleVal, DoubleVal, DoubleVal))
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/impala/udf/__init__.py", line 34, in wrapper
    udfobj = UDF(pyfunc, signature)
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/impala/udf/__init__.py", line 57, in __init__
    flags=flags, locals={})
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/numba/compiler.py", line 551, in compile_extra
    return pipeline.compile_extra(func)
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/numba/compiler.py", line 261, in compile_extra
    return self.compile_bytecode(res.result, func_attr=self.func_attr)
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/numba/compiler.py", line 275, in compile_bytecode
    return self._compile_bytecode()
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/numba/compiler.py", line 501, in _compile_bytecode
    return self._run_pipeline(pipelines)
  File "/Users/daan/.virtualenvs/hermes/lib/python2.7/site-packages/numba/compiler.py", line 526, in _run_pipeline
    raise _raise_error(msg, res.exception)
numba.lowering.LoweringError: Failed at nopython mode backend
Internal error:
Exception: No definition for lowering <built-in function sin>(float64,) -> float64
File "install_distance.py", line 31

I assume this is a Numba problem, but I can't find any documentation on python math support in Numba. Has anyone else experienced this issue? Or am I just doing something wrong? :)

unittest.skip not exist in Python2.6

Need to use unittest2 when in python2.6 to use unittest.skip.

example of problem: https://github.com/cloudera/impyla/blob/udf/tests/test_impala.py#L58

This is currently breaking numba test on travis: https://travis-ci.org/numba/numba/jobs/26412805

Factor out interface to thrift object to wrap thrift and thriftpy

For py2 and py3 coexistence.

from_pandas error 'HiveServer2Connection' object has no attribute '_temp_dir'

Hello,

I'm trying to use impala.bdf.from_pandas function, I've got the following code :

# -*- coding: utf8
import pandas as pd
from impala.dbapi import connect
from impala.bdf import from_pandas

if __name__ == '__main__':
    csv = pd.read_csv("mydata.csv")
    ci = connect(host="localhost",
                 port=21050)
    from_pandas(ci, csv, method='in_query')

I've got the following error message :
Traceback (most recent call last): File "panda_df_impala_issue.py", line 13, in <module> from_pandas(ci, csv, method='in_query') File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/bdf.py", line 117, in from_pandas table = "%s.%s" % (ic._temp_db, temp_table) AttributeError: 'HiveServer2Connection' object has no attribute '_temp_db'

I tried to fill the table parameter of the from_pandas function, but a string doesn't fit. What is the type of this parameter ?

Any other idea to help me ?

Impyla and Python UDFs

I'm very eager to try out Python UDFs on Impala. I just don't really know how to get going. A couple of questions to make stuff more clear:

I saw this merge commit: 157ef1e Am I correct that the Python UDF code now lives in the master branch?
In the udf package I see imports like this:

import llvm.core as lc
from numba import sigutils
from numba.compiler import compile_extra, Flags

When I look at the setup.py, I don't see any extra packages that are to be installed however. Should I install numba (and any other packages) manually for udfs to work?

Maybe the README should be updated with some instructions for trying out Python UDFs :)

Failed to import pandas message makes Pandas kind of required

Impyla tries to import Pandas, and when it fails, it will print "Failed to import pandas". I'm using Impyla as part of a command line tool, so this message will always pop up when Pandas is not installed, even though I have no need of the as_pandas() method, and thus don't want to force users to install Pandas.
If Impyla would make use of Python's logging facilities, I could at least "swallow" that message in my app by also configuring logging. Any ETA on this?

Maybe even better would be to do the Pandas import inside the as_pandas() method. Like this:

def as_pandas(cursor):
    try:
        import pandas as pd
        names = [metadata[0] for metadata in cursor.description]
        return pd.DataFrame.from_records(cursor.fetchall(), columns=names)
    except ImportError:
        print "Failed to import pandas"

Cursor methods get_tables() and get_databases() crash

Cursor methods get_tables() and get_databases() produce the following error:
AttributeError: 'TGetSchemasResp' object has no attribute 'operation_handle'

The reason is that operation_handle in rpc.py is misspelled: should be operationHandle.

Invalid method name 'OpenSession' when I try to connect

When I run impala.dbapi.connect(port=21000).cursor(), I get this error:

/home/julia/clones/impyla/impala/dbapi.py in cursor(self, session_handle, user, configuration)
     71             user = getpass.getuser()
     72         if session_handle is None:
---> 73             session_handle = impala.rpc.open_session(self.service, user, configuration)
     74         return Cursor(self.service, session_handle)
     75 

/home/julia/clones/impyla/impala/rpc.py in wrapper(*args, **kwargs)
    116                 if not transport.isOpen():
    117                     transport.open()
--> 118                 return func(*args, **kwargs)
    119             except socket.error as e:
    120                 pass

/home/julia/clones/impyla/impala/rpc.py in open_session(service, user, configuration)
    186 def open_session(service, user, configuration=None):
    187     req = TOpenSessionReq(username=user, configuration=configuration)
--> 188     resp = service.OpenSession(req)
    189     err_if_rpc_not_ok(resp)
    190     return resp.sessionHandle

/home/julia/clones/impyla/impala/cli_service/TCLIService.py in OpenSession(self, req)
    152     """
    153     self.send_OpenSession(req)
--> 154     return self.recv_OpenSession()
    155 
    156   def send_OpenSession(self, req):

/home/julia/clones/impyla/impala/cli_service/TCLIService.py in recv_OpenSession(self)
    168       x.read(self._iprot)
    169       self._iprot.readMessageEnd()
--> 170       raise x
    171     result = OpenSession_result()
    172     result.read(self._iprot)

TApplicationException: Invalid method name: 'OpenSession'

What does this error mean? I'm not sure how to debug this.

Add tests for each possible Impala type

Kerberos support in HDFS client

I've just installed impyla on our cluster and have had success in using for query execution, however it seems its using the pywebhdfs module which doesn't support kerberised clusters, so any hdfs related functionality doesn't work. Are there plans to move to a module that supports Kerberos auth?
For instance, https://pypi.python.org/pypi/hdfs/0.5.0.

Connection error when shipping UDF

Hi!

I finally had some time to try out the Python UDF goodness, but I ran into some problems. I'm running a 3-node CDH5.1.2 cluster on Amazon AWS. I tried one of the simple UDFs that is used in the Impyla tests, like this:

from impala.udf import udf, ship_udf
from impala.udf.types import (FunctionContext, BooleanVal, SmallIntVal, IntVal,
                              BigIntVal, StringVal)
from impala.context import ImpalaContext

@udf(BigIntVal(FunctionContext, IntVal))
def int_promotion(context, x):
    return x + 1

ctx = ImpalaContext(nn_host='ec2-54-171-167-66.eu-west-1.compute.amazonaws.com', host='ec2-54-171-176-150.eu-west-1.compute.amazonaws.com', hdfs_user='hdfs', webhdfs_port=50070)

ship_udf(ctx, int_promotion, overwrite=True)

which gives me the following error:

ConnectionError: HTTPConnectionPool(host='ip-172-31-16-101.eu-west-1.compute.internal', port=50075): Max retries exceeded with url: /webhdfs/v1/tmp/impyla-MLXIPYFS/int_promotion.ll?op=CREATE&user.name=hdfs&namenoderpcaddress=ip-172-31-16-100.eu-west-1.compute.internal:8020&overwrite=true (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)

I'm sure it's something simple like opening the right port on either the NN or the Data Nodes, but I'm not seeing it :)

Add jenkins tests against multiple versions of Python

We should be testing multiple version of Py2 and Py3

Query options passed in cursor.execute(configuration={dict of query options}) don't work

Say I'm trying to limit the memory or a query:
In [2]: conn = connect(host='localhost', port=21050)
In [3]: cur = conn.cursor()
In [4]: cur.execute("SELECT COUNT(*) FROM tpch.lineitem", configuration={"MEM_LIMIT":"1"})
In [5]: cur.fetchall()
Out[5]: [(6001215,)]

This should fail because I'm limiting the memory down to 1 byte.

If I set the configuration at the cursor creation, it does fail as expected.
In [11]: limited_cur = conn.cursor(configuration={"MEM_LIMIT":"1"})
In [12]: limited_cur.execute("SELECT COUNT(*) FROM tpch.lineitem")

In [13]: limited_cur.fetchall()
~~~error stuff~~~
HiveServer2Error: Memory limit exceeded

Refactor scikit-learn emulation and blob store to better integrate with ImpalaContext/BDF

Query create table with lot of fields don't work

Hi,

I'm working on the Cloudera single node Virtual Machine. I'm trying to create a table with lot of fields though the code bellow :

# -*- coding: utf8 

from impala.dbapi import connect

conn = connect(host='localhost',
               port=21050)

cur = conn.cursor()

sql = '''CREATE TABLE row_table (a  smallint,
b   smallint,
c   smallint,
d   BIGINT,
e   SMALLINT,
f    INT,
g    INT,
h   SMALLINT,
i    INT,
j    INT,
k    INT,
l   SMALLINT,
m   smallint,
n    INT,
o   smallint,
p    INT,
q    BIGINT,
r    BIGINT,
s    BIGINT,
t    BIGINT,
u    BIGINT,
v    BIGINT,
w    BIGINT,
x    BIGINT,
y    BIGINT,
z    BIGINT,
aa   INT,
ab   INT,
ac   INT,
ad   INT,
ae   INT,
af   INT,
ag   INT,
ah   INT,
ai   INT,
aj   INT,
ak   INT,
al   INT,
am   INT,
an   INT,
ao   INT,
ap   INT,
aq   INT,
ar   INT,
as   INT,
at   INT,
au   INT,
av   INT,
aw   INT,
ax   INT,
ay  smallint,
az   BIGINT,
ba   BIGINT,
bb  smallint,
bc   BIGINT,
bd   BIGINT,
be  smallint,
bf  smallint,
bg   BIGINT,
bh   BIGINT,
bj  smallint,
bk   BIGINT,
bl   BIGINT,
bm  STRING,
bn  STRING,
bo  STRING,
bp  STRING,
bq  STRING,
br  STRING,
bs  STRING,
bt  STRING,
bu  STRING,
bv  STRING,
bw   INT,
bx  STRING,
by  STRING,
bz  STRING,
ca   BIGINT,
cb   BIGINT,
cc  smallint,
cd  smallint,
ce  STRING,
cf  smallint,
cg   INT,
ch  smallint,
cj   INT,
ck  smallint,
cl  smallint,
cm  smallint,
cn  smallint,
co  smallint,
cp  smallint,
cq  smallint,
cr   INT,
cs   INT,
ct   INT,
cu   INT,
cv  STRING,
cw  STRING,
cx  STRING,
cy  STRING,
cz   BIGINT,
da   BIGINT,
db   BIGINT,
dc   BIGINT,
dd   BIGINT,
de   BIGINT,
df   BIGINT,
dg   BIGINT,
dh   BIGINT,
dj   BIGINT,
dk   INT,
dl   INT,
dm   INT,
dn   INT,
do   INT,
dp   INT,
dq   BIGINT,
dr   BIGINT,
ds   BIGINT,
dt   BIGINT,
du   BIGINT,
dv   BIGINT,
dw   BIGINT,
dx   BIGINT,
dy   BIGINT,
dz   BIGINT,
ea   BIGINT,
eb   BIGINT,
ec   BIGINT,
ed   BIGINT,
ee   BIGINT,
ef   INT,
eg   BIGINT,
eh   BIGINT,
ej   BIGINT,
ek   BIGINT,
el   BIGINT,
em   BIGINT,
en   BIGINT,
eo   BIGINT,
ep   BIGINT,
eq   BIGINT,
er   BIGINT,
es   BIGINT,
et   BIGINT,
eu   BIGINT,
ev   BIGINT,
ew   BIGINT,
ex   BIGINT,
ey   BIGINT,
ez   BIGINT,
fa   BIGINT,
fb   BIGINT,
fc   BIGINT,
fd   BIGINT,
fe   BIGINT,
ff   BIGINT,
fg   BIGINT,
fh   BIGINT,
fj   BIGINT,
fk   BIGINT,
fl   BIGINT,
fm   INT,
fn   BIGINT,
fo   BIGINT,
fp   BIGINT,
fq   BIGINT,
fr   BIGINT,
fs   BIGINT,
ft   BIGINT,
fu   BIGINT,
fv   BIGINT,
fw   BIGINT,
fx   BIGINT,
fy   BIGINT,
fz   BIGINT,
gca  BIGINT,
gcb  BIGINT,
gcc  BIGINT,
gcd  BIGINT,
gce  BIGINT,
gcf  BIGINT,
gcg  BIGINT,
gch  BIGINT,
gcj  BIGINT,
gck  BIGINT,
gcl  BIGINT,
gcm  BIGINT,
gcn STRING,
gco STRING,
gcp  INT,
gcq  BIGINT,
gcr  BIGINT,
gcs  BIGINT,
gct  BIGINT,
gcu  BIGINT,
gcv  INT,
gcw  INT,
gcx  INT,
gcy  INT,
gcz  INT,
ger  INT,
get  INT,
gey  INT,
geu  INT
) STORED AS PARQUET;'''
cur.execute(sql)

It doesn't work, it generate the following message :

Traceback (most recent call last):
  File "/home/cloudera/Public/base_big_data/base/test_big_query_impyla.py", line 219, in <module>
    cur.execute(sql)
  File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/dbapi.py", line 156, in execute
    self._execute_sync(op)
  File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/dbapi.py", line 162, in _execute_sync
    operation_fn()
  File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/dbapi.py", line 155, in op
    self.service, self.session_handle, self._last_operation_string)
  File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/rpc.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/rpc.py", line 203, in execute_statement
    err_if_rpc_not_ok(resp)
  File "/home/cloudera/Public/base_big_data/virtenv_py26/lib/python2.6/site-packages/impala/error.py", line 57, in err_if_rpc_not_ok
    raise RPCError("RPC status error: %s: %s" % (resp.__class__.__name__, str(resp.status)))
impala.error.RPCError: RPC status error: TExecuteStatementResp: TStatus(errorCode=None, errorMessage='AnalysisException: Syntax error in line 45:\nas\t INT,\n^\nEncountered: AS\nExpected: IDENTIFIER\n\nCAUSED BY: Exception: Syntax error', sqlState='HY000', infoMessages=None, statusCode=3)

If I copy and paste the create table query, in a file and execute it though "impala-shell -f query.txt" it work perfectly. So I'm thinking that's a trouble of the impyla library.

What do you thinking about that ?

Allow Jenkins scripts to easily test other configurations

Add support for HS2 protocol v6

Error in running python scripts to initiate subprocess

I have two python scripts which I am using to initiate subprocess. Following is the structure of my scripts:

MAIN.py

This scripts does nothing except initiating the whole process and calling another python script through subprocess

paramet = ['parameter']
result = subprocess.Popen([sys.executable,"./sub_process.py"] + paramet)
result.wait()
sub_process.py

This is the script which first executes bunch of SQL statements and then initiate another subprocess by calling itself again. The number of subprocess spawned depends on the parameter being passed with each subprocess call. Also the SQl statements are mostly insert statements

conn = connect(host=host_ip, port=21050, timeout=3600)
cursor = conn.cursor()

sql = "SQL statement"
cursor.execute(sql)
result_set = cursor.fetchall()
for col1, col2 in result_set:
    sql = "SQL statement 2"
    cursor.execute(sql)

    sql = "SQL statement 3"
    cursor.execute(sql)

    if paramet == 'para':
           result = subprocess.Popen([sys.executable,"./sub_process.py"] + paramet)
result.wait()

Now this setup works fine on my local machine but when I am trying to run the same setup on server it throws following error:

File "/usr/local/lib/python2.7/site-packages/impala/dbapi/hiveserver2.py", line 151, in execute
self._execute_sync(op)
File "/usr/local/lib/python2.7/site-packages/impala/dbapi/hiveserver2.py", line 159, in _execute_sync
self._wait_to_finish() # make execute synchronous
File "/usr/local/lib/python2.7/site-packages/impala/dbapi/hiveserver2.py", line 181, in _wait_to_finish
raise OperationalError("Operation is in ERROR_STATE")
OperationalError: Operation is in ERROR_STATE

Even if I execute only one sql statements instead of bunch of sqls the same error comes on server but not on my local machine. I tried to look for some reasons for this but couldn't find anything which might provide the reason for this error.

I have CentOS 6.6 on my server

How can I find the reason for this and resolve it? Also what could be a workaround to get the same behavior through other ways if this is not getting resolved? What I basically want is that once a process reaches a certain point in its execution, it should start another process and also continue to keep on executing simultaneously. Once it is done then it should wait for the subprocess to finish before exiting.

Add logging to impyla

Add Jenkins tests against a Kerberized cluster

Move impyla UDF infra onto llvmlite

Executing statements in multiple processes simultaneously produces an error

Example case:

from impala.dbapi import connect
from multiprocessing import Process
conn = connect("127.0.0.1", 21050)
cursor = conn.cursor()

def f():
cursor.execute("SHOW TABLES")
print cursor.fetchall()

def g():
Process(target=f).start()

[g() for i in range(10)]

Error:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
self._target(_self._args, *_self._kwargs)
File "", line 2, in f
File "/usr/lib/python2.6/site-packages/impala/dbapi.py", line 156, in execute
self._execute_sync(op)
File "/usr/lib/python2.6/site-packages/impala/dbapi.py", line 164, in _execute_sync
self._wait_to_finish() # make execute synchronous
File "/usr/lib/python2.6/site-packages/impala/dbapi.py", line 185, in _wait_to_finish
self._last_operation_handle)
File "/usr/lib/python2.6/site-packages/impala/rpc.py", line 118, in wrapper
return func(_args, *_kwargs)
File "/usr/lib/python2.6/site-packages/impala/rpc.py", line 329, in get_operation_status
return TOperationState._VALUES_TO_NAMES[resp.operationState]
KeyError: None

This doesn't happen if conn.cursor() is called in f to get a new cursor on each iteration. However, I do not think this is an error, which is supposed to be produced in this scenario.

Increase default fetch batch size to 1024

To improve performance and be more in line with the impala shell. Note that this will break the PEP 249-mandated single row per fetch default.

UDF clang settings?

Using the latest master 80936d4

I'm trying to submit a simple udf to impala, but it seems to fail with

ERROR: AnalysisException: Could not load binary: hdfs://impala/user/impala_test/udf/test.ll
Could not parse module /tmp/test.3723.6.ll: Malformed block record

The udf itself is just taken from the examples in the tests.

I'm using the version of llvm included with numpy and conda, and a local prefixed version of boost.

Is there something specific i'm missing?

impalad version 1.4.0-cdh4

$ clang --version
clang version 3.3 (tags/RELEASE_33/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix

numba version: 0.14
numpy version: 1.9
python version: 2.7.8

Cursor and Connection classes should implement exit method

Cursor and Connection classes should both implement exit method so that they can be used in a with statement. Currently the following code produces an error:

with connect("127.0.0.1", 21050) as conn:
with conn.cursor() as cursor:
cursor.execute("show tables")
print cursor.fetchall()

Traceback (most recent call last):
File "", line 1, in
AttributeError: 'Connection' object has no attribute 'exit'

Better error messages

Pass through prettier error messages when a query is malformed etc. Currently, there is always an RPCError. But the actual Impala error should be extracted and formatted.

Support Python 3

I'm happy to help with this if testing can be made convenient.

Support lists/tuples as context.execute() parameters

PEP 249 defines that context.execute(statement, parameters) should support parameters as sequences or mappings. If you pass, for instance a list, it fails with AttributeError

invalid query handle

I was wandering if anyone has experienced this error and knows what maybe causing it? We seem to suffer from it periodontally. thanks

HS2Error when running as_pandas

I'm running a smallish query (the result is 8MB of data), and getting an HS2Error when I try to read the data. as_pandas is working on smaller queries. Any idea what could be going on here?

Here's what I'm running:

import impala.dbapi
from impala.util import as_pandas
c = impala.dbapi.connect(port=21050).cursor() # works fine
c.execute("[my query]") # works fine
df = as_pandas(c) # oh no!

and the error:

---------------------------------------------------------------------------
HS2Error                                  Traceback (most recent call last)
<ipython-input-5-bee92ca13acd> in <module>()
----> 1 df = as_pandas(c)

/Users/jocelyn/anaconda/lib/python2.7/site-packages/impyla-0.9.0_dev-py2.7.egg/impala/util.pyc
in as_pandas(cursor)
     21     def as_pandas(cursor):
     22         names = [metadata[0] for metadata in cursor.description]
---> 23         return pd.DataFrame([dict(zip(names, row)) for row in
cursor], columns=names)
     24 except ImportError:
     25     print "Failed to import pandas"

/Users/jocelyn/anaconda/lib/python2.7/site-packages/impyla-0.9.0_dev-py2.7.egg/impala/dbapi.pyc
in next(self)
    246             rows = impala.rpc.fetch_results(self.service,
    247                     self._last_operation_handle, self.description,
--> 248                     self.buffersize)
    249             self._buffer.extend(rows)
    250             if len(self._buffer) == 0:

/Users/jocelyn/anaconda/lib/python2.7/site-packages/impyla-0.9.0_dev-py2.7.egg/impala/rpc.pyc
in wrapper(*args, **kwargs)
    116                 if not transport.isOpen():
    117                     transport.open()
--> 118                 return func(*args, **kwargs)
    119             except socket.error as e:
    120                 pass

/Users/jocelyn/anaconda/lib/python2.7/site-packages/impyla-0.9.0_dev-py2.7.egg/impala/rpc.pyc
in fetch_results(service, operation_handle, schema, max_rows,
orientation)
    235                            maxRows=max_rows)
    236     resp = service.FetchResults(req)
--> 237     err_if_rpc_not_ok(resp)
    238
    239     rows = []

/Users/jocelyn/anaconda/lib/python2.7/site-packages/impyla-0.9.0_dev-py2.7.egg/impala/error.pyc
in err_if_rpc_not_ok(resp)
     55     if (resp.status.statusCode !=
TStatusCode._NAMES_TO_VALUES['SUCCESS_STATUS'] and
     56             resp.status.statusCode !=
TStatusCode._NAMES_TO_VALUES['SUCCESS_WITH_INFO_STATUS']):
---> 57         raise HS2Error(resp.status.errorMessage)

HS2Error: Invalid session id

Add regular style checks with prospector

UDF - where is numba.ext.impala?

I am probably missing something obvious here, so bear with me...

I have installed and am currently using impyla to connect to and query impala. I have also installed the dependencies for UDFs (including numba, llvm, llvmpy, boost), but importing numba.ext.impala does not work. Where do I need to get this?

Remove Jenkins test dependency on `jq`

Get impyla running with SSL on Python 3

The new Python 3 support depends on thriftpy, which does not implement TSSLSocket. This is necessary for supporting SSL connections. Also see Thriftpy/thriftpy#136.

Add Sphinx docs

impyla is still minimally documented.

Add Jenkins job against nightly build

Thrift error connecting to cdh 5.4

Using impyla 0.9.1 -> cdh 5.4 (hiveserver2) I am getting socket timeout....

Getting the following thrift error:
File "/Users/jlord/cops/exphdfs.py", line 30, in
cursor = conn.cursor()
File "/Users/jlord/anaconda/lib/python2.7/site-packages/impyla-0.9.1-py2.7.egg/impala/dbapi/hiveserver2.py", line 55, in cursor
rpc.open_session(self.service, user, configuration))
File "/Users/jlord/anaconda/lib/python2.7/site-packages/impyla-0.9.1-py2.7.egg/impala/_rpc/hiveserver2.py", line 132, in wrapper
return func(_args, *_kwargs)
File "/Users/jlord/anaconda/lib/python2.7/site-packages/impyla-0.9.1-py2.7.egg/impala/_rpc/hiveserver2.py", line 214, in open_session
resp = service.OpenSession(req)
File "/Users/jlord/anaconda/lib/python2.7/site-packages/impyla-0.9.1-py2.7.egg/impala/_thrift_gen/TCLIService/TCLIService.py", line 175, in OpenSession
return self.recv_OpenSession()
File "/Users/jlord/anaconda/lib/python2.7/site-packages/impyla-0.9.1-py2.7.egg/impala/_thrift_gen/TCLIService/TCLIService.py", line 186, in recv_OpenSession
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "build/bdist.macosx-10.5-x86_64/egg/thrift/protocol/TBinaryProtocol.py", line 140, in readMessageBegin
File "build/bdist.macosx-10.5-x86_64/egg/thrift/transport/TTransport.py", line 58, in readAll
File "build/bdist.macosx-10.5-x86_64/egg/thrift/transport/TTransport.py", line 159, in read
File "build/bdist.macosx-10.5-x86_64/egg/thrift/transport/TSocket.py", line 105, in read
socket.timeout: timed out

Than from the hiverserver2 log:

7:21:32.041 PM ERROR org.apache.thrift.server.TThreadPoolServer
Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Invalid status -128
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status -128
at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more

ex:

conn = connect(host='lannister-002.edh.cloudera.com',port=10000)
cursor = conn.cursor()
cursor.execute('SELECT * FROM test_customer_logs LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()

Document supported Impala version(s)

impyla should support connecting to Hive's HiveServer2

It would be useful if impyla was compatible with Hive's HiveServer2 implementations. In theory, since Impala and Hive both implement the same HS2 API, this should work fine aside from Impala specific query options.

as_pandas() doesn't work on empty result set

If you do something like so:
cursor.execute("""select * from philip_data where 0=1"""); as_pandas(cursor)
You'll get the following exceptions. Presumably, there's a notion of an empty dataframe, and it should be supported.

I'm doing this all in an IPython notebook, for what it's worth.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in ()
----> 1 cursor.execute("""select * from philip_data where 0=1"""); as_pandas(cursor)

/home/philip/work/env/lib/python2.7/site-packages/impyla-0.8.1-py2.7.egg/impala/util.pyc in as_pandas(cursor)
     21     def as_pandas(cursor):
     22         names = [metadata[0] for metadata in cursor.description]
---> 23         return pd.DataFrame([dict(zip(names, row)) for row in cursor], columns=names)
     24 except ImportError:
     25     print "Failed to import pandas"

/home/philip/work/env/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    253             else:
    254                 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 255                                          copy=copy)
    256         elif isinstance(data, collections.Iterator):
    257             raise TypeError("data argument can't be an iterator")

/home/philip/work/env/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _init_ndarray(self, values, index, columns, dtype, copy)
    365             columns = _ensure_index(columns)
    366 
--> 367         return create_block_manager_from_blocks([values.T], [columns, index])
    368 
    369     @property

/home/philip/work/env/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in create_block_manager_from_blocks(blocks, axes)
   3228         blocks = [getattr(b, 'values', b) for b in blocks]
   3229         tot_items = sum(b.shape[0] for b in blocks)
-> 3230         construction_error(tot_items, blocks[0].shape[1:], axes, e)
   3231 
   3232 

/home/philip/work/env/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in construction_error(tot_items, block_shape, axes, e)
   3209         raise e
   3210     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 3211         passed,implied))
   3212 
   3213 

ValueError: Shape of passed values is (0, 0), indices imply (12, 0)