Giter VIP home page Giter VIP logo

pyhive's Introduction

Project is currently unsupported

image

image

PyHive

PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto , Hive and Trino.

Usage

DB-API

from pyhive import presto  # or import hive or import trino
cursor = presto.connect('localhost').cursor()  # or use hive.connect or use trino.connect
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
print cursor.fetchone()
print cursor.fetchall()

DB-API (asynchronous)

from pyhive import hive
from TCLIService.ttypes import TOperationState
cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10', async=True)

status = cursor.poll().operationState
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
    logs = cursor.fetch_logs()
    for message in logs:
        print message

    # If needed, an asynchronous query can be cancelled at any time with:
    # cursor.cancel()

    status = cursor.poll().operationState

print cursor.fetchall()

In Python 3.7 async became a keyword; you can use async_ instead:

cursor.execute('SELECT * FROM my_awesome_data LIMIT 10', async_=True)

SQLAlchemy

First install this package to register it with SQLAlchemy, see entry_points in setup.py.

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
# Presto
engine = create_engine('presto://localhost:8080/hive/default')
# Trino
engine = create_engine('trino+pyhive://localhost:8080/hive/default')
# Hive
engine = create_engine('hive://localhost:10000/default')

# SQLAlchemy < 2.0
logs = Table('my_awesome_data', MetaData(bind=engine), autoload=True)
print select([func.count('*')], from_obj=logs).scalar()

# Hive + HTTPS + LDAP or basic Auth
engine = create_engine('hive+https://username:password@localhost:10000/')
logs = Table('my_awesome_data', MetaData(bind=engine), autoload=True)
print select([func.count('*')], from_obj=logs).scalar()

# SQLAlchemy >= 2.0
metadata_obj = MetaData()
books = Table("books", metadata_obj, Column("id", Integer), Column("title", String), Column("primary_author", String))
metadata_obj.create_all(engine)
inspector = inspect(engine)
inspector.get_columns('books')

with engine.connect() as con:
    data = [{ "id": 1, "title": "The Hobbit", "primary_author": "Tolkien" }, 
            { "id": 2, "title": "The Silmarillion", "primary_author": "Tolkien" }]
    con.execute(books.insert(), data[0])
    result = con.execute(text("select * from books"))
    print(result.fetchall())

Note: query generation functionality is not exhaustive or fully tested, but there should be no problem with raw SQL.

Passing session configuration

# DB-API
hive.connect('localhost', configuration={'hive.exec.reducers.max': '123'})
presto.connect('localhost', session_props={'query_max_run_time': '1234m'})
trino.connect('localhost',  session_props={'query_max_run_time': '1234m'})
# SQLAlchemy
create_engine(
    'presto://user@host:443/hive',
    connect_args={'protocol': 'https',
                  'session_props': {'query_max_run_time': '1234m'}}
)
create_engine(
    'trino+pyhive://user@host:443/hive',
    connect_args={'protocol': 'https',
                  'session_props': {'query_max_run_time': '1234m'}}
)
create_engine(
    'hive://user@host:10000/database',
    connect_args={'configuration': {'hive.exec.reducers.max': '123'}},
)
# SQLAlchemy with LDAP
create_engine(
    'hive://user:password@host:10000/database',
    connect_args={'auth': 'LDAP'},
)

Requirements

Install using

  • pip install 'pyhive[hive]' or pip install 'pyhive[hive_pure_sasl]' for the Hive interface
  • pip install 'pyhive[presto]' for the Presto interface
  • pip install 'pyhive[trino]' for the Trino interface

Note: 'pyhive[hive]' extras uses sasl that doesn't support Python 3.11, See github issue. Hence PyHive also supports pure-sasl via additional extras 'pyhive[hive_pure_sasl]' which support Python 3.11.

PyHive works with

Changelog

See https://github.com/dropbox/PyHive/releases.

Contributing

  • Please fill out the Dropbox Contributor License Agreement at https://opensource.dropbox.com/cla/ and note this in your pull request.
  • Changes must come with tests, with the exception of trivial things like fixing comments. See .travis.yml for the test environment setup.
  • Notes on project scope:
    • This project is intended to be a minimal Hive/Presto client that does that one thing and nothing else. Features that can be implemented on top of PyHive, such integration with your favorite data analysis library, are likely out of scope.
    • We prefer having a small number of generic features over a large number of specialized, inflexible features. For example, the Presto code takes an arbitrary requests_session argument for customizing HTTP calls, as opposed to having a separate parameter/branch for each requests option.

Tips for test environment setup

You can setup test environment by following .travis.yaml in this repository. It uses Cloudera's CDH 5 which requires username and password for download. It may not be feasible for everyone to get those credentials. Hence below are alternative instructions to setup test environment.

You can clone this repository which has Docker Compose setup for Presto and Hive. You can add below lines to its docker-compose.yaml to start Trino in same environment:

trino:
    image: trinodb/trino:351    
    ports:     
        - "18080:18080"    
    volumes:    
        - ./trino:/etc/trino

Note: ./trino for docker volume defined above is trino config from PyHive repository

Then run::

docker-compose up -d

Testing

image

image

Run the following in an environment with Hive/Presto:

./scripts/make_test_tables.sh
virtualenv --no-site-packages env
source env/bin/activate
pip install -e .
pip install -r dev_requirements.txt
py.test

WARNING: This drops/creates tables named one_row, one_row_complex, and many_rows, plus a database called pyhive_test_database.

Updating TCLIService

The TCLIService module is autogenerated using a TCLIService.thrift file. To update it, the generate.py file can be used: python generate.py <TCLIServiceURL>. When left blank, the version for Hive 2.3 will be downloaded.

pyhive's People

Contributors

betodealmeida avatar bkyryliuk avatar cpcloud avatar devinstevenson avatar dpgaspar avatar elukey avatar fokko avatar jingw avatar klaussfreire avatar lordshinjo avatar mariusvniekerk avatar matthewwardrop avatar mb-m avatar mdeshmu avatar mrocklin avatar nicholas-miles avatar petobens avatar ptallada avatar quasiben avatar ralnoc avatar serenajiang avatar shashwatarghode avatar shkr avatar tarekrached avatar timifasubaa avatar usiel avatar wgzhao avatar wxiang7 avatar xd-deng avatar znd4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyhive's Issues

Release 0.1.6?

The commits that include tcliservice and thrift_sasl as a dependency make this library a whole lot easier to install and consume. Can we get a release?

`select * from table sort by rand() limit 100` does not work

Trying the query

select * from table sort by rand() limit 100

gives the following error:

DatabaseError: Execution failed on sql: select * from table sort by rand() limit 100
TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState='08S01', infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:29:28', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:314', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:146', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:173', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:256', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:376', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:357', 'sun.reflect.GeneratedMethodAccessor112:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:79', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:37', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:64', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:415', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1628', 'org.apache.hadoop.hive.shims.HadoopShimsSecure:doAs:HadoopShimsSecure.java:536', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:60', 'com.sun.proxy.$Proxy23:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:234', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:401', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:206', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)
unable to rollback

It works on beeline and hive cli after opening a Tez session.

pyhive "create table as select"

Hello,

I managed to create an empty table and to read data from hiveserver2 without any problem. However when I try to perform a "Create table as select" the engine does not want to launch it on hiveserver.

Here is the error :
TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState='08S01', infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:315', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:156', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:183', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:419', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:400', 'sun.reflect.GeneratedMethodAccessor88:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:497', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1657', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy20:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:261', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:486', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1317', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1302', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:285', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

Thanks a lot for your help ๐Ÿ’ƒ

Pre-fetch method

When working with large amounts of data, it'd be nice to have fetch continue to pull records in another thread. For example, cursor.prefetchmany(100000) would return 100k rows on the first call, then spawn a new thread to fetch the next 100k rows.

Hive connections not working on Windows

Using Anaconda2 with sasl from: http://www.lfd.uci.edu/~gohlke/pythonlibs/ allows the package to install and load, but establishing a connection fails:

In [1]: from pyhive import hive

In [2]: connection = hive.connect(xxxxx)
---------------------------------------------------------------------------
TTransportException                       Traceback (most recent call last)
<ipython-input-2-6036f792e6bb> in <module>()
----> 1 connection = hive.connect(xxxxx)

C:\Anaconda2\lib\site-packages\pyhive\hive.pyc in connect(*args, **kwargs)
     59     :returns: a :py:class:`Connection` object.
     60     """
---> 61     return Connection(*args, **kwargs)
     62
     63

C:\Anaconda2\lib\site-packages\pyhive\hive.pyc in __init__(self, host, port, use
rname, database, configuration)
     84
     85         try:
---> 86             self._transport.open()
     87             open_session_req = ttypes.TOpenSessionReq(
     88                 client_protocol=ttypes.TProtocolVersion.HIVE_CLI_SERVICE
_PROTOCOL_V1,

C:\Anaconda2\lib\site-packages\thrift_sasl\__init__.pyc in open(self)
     70     if not ret:
     71       raise TTransportException(type=TTransportException.NOT_OPEN,
---> 72         message=("Could not start SASL: %s" % self.sasl.getError()))
     73
     74     # Send initial response

TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(
-4): no mechanism available: Unable to find a callback: 2

Which is probably an issue with sasl libraries not being readily available on Windows, but if anyone has managed to get pyhive working on Windows I'd appreciate a pointer

Support high availability for Presto DB-API

Currently you can only pass one host on cursor.connect(). Is there any interest on support multiple hosts? In case one of the hosts don't respond we try use another...

If yes, I can work on it and make a PR.

Track presto / hive query progress

I am wondering if there is a way to track the query progress within pyhive ?

I have found that the library polls the stats on the regular basis:

:param poll_interval: int -- how often to ask the Presto REST interface for a progress

However I couldn't find a way to retrieve the data to calculate the progress.
Any ideas how can it be implemented ?

For context, I am working on adding the SQL editor to the Caravel and that feature would be great for the users.

allow user to specify other authMechanism

I want to use the PyHive integrate with SQLAlchemy to operate Hive and Presto.

Presto works well, but for Hive, the authMechanism is fixed, PLAIN.
https://github.com/dropbox/PyHive/search?utf8=%E2%9C%93&q=PLAIN
So when the required mechanism is not PLAIN, it will complain:

thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

Is there way to support authMechanism like pyhs2?
https://github.com/BradRuderman/pyhs2/search?utf8=%E2%9C%93&q=authMechanism&type=Code

Allow inputting passwords to hive connection

Hello all!

I was wondering why there is no fields or parameters to put in the passwords when we are connecting to hive server? I think it's something that pyhive could easily support.. Maybe I missed it but I could not figure out :p

If there is a workaround, please tell me how. Otherwise, I can work on adding that feature.

ImportError: No module named builtins

the package won't import even though i pip upgraded and made sure i'm using pyhive-0.2.1. I sudo pip installed future - and I am still getting this error?

from pyhive import hive
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pyhive/hive.py", line 13, in
from pyhive import common
File "/Library/Python/2.7/site-packages/pyhive/common.py", line 8, in
from builtins import bytes
ImportError: No module named builtins

Array/Map data type support?

First of all, thank you for open sourcing such an awesome library! I'm just curious, but do you have any plan to integrate Array/Map datatype into PyHive?

Cannot find a way to create a hive engine with SQLalchemy

I'm trying a similar code with the correct uri and I got the error below.

create_engine(
    'hive://user@host:10000/database',
    connect_args={'configuration': {'hive.exec.reducers.max': '123'}},
)

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:hive

Output Hive `INFO` and `WARN` messages

When running a query that spawns a tez or mapreduce job, the beeline and hive clis provide messages of what Hive is doing. Is there a way to see these messages when a query is done with Pyhive? At least the application_id when running under YARN?

Question: How can I get result types for a Hive query?

.description gives:

[(u'my_schema.col_one', u'BIGINT_TYPE', None, None, None, None, True),
 (u'my_schema.col_two', u'DATE_TYPE', None, None, None, None, True),
 (u'my_schema.col_three',
  u'BIGINT_TYPE',
  None,
  None,
  None,
  None,
  True)]

So the second field gives the types, but I would really like to be able to get at column types I can use in DDL statements. In this case, BIGINT, DATE, and BIGINT.

What's the best way to do this?

I see two options:

  1. String manipulation. Strip _TYPE off.
  2. Add it to DBAPITypeObject by looking it up in TCLIService.constants.TYPE_NAMES:
for type_id in constants.PRIMITIVE_TYPES:
    name = ttypes.TTypeId._VALUES_TO_NAMES[type_id]
    setattr(sys.modules[__name__], name, DBAPITypeObject([name], constants.TYPE_NAMES[type_id]))

Then add __str__ to DBAPITypeObject which returns a valid column name for use in DDL statements.

Error doing `SELECT` operations on Hive, `show databases` works fine

I have a Hive server using Tez that I can connect correctly:

import pyhive
from pyhive import hive

cursor = hive.connect(connect_host, port=10000, configuration={'job.queue.name': 'myqueue'}).cursor()
cursor.execute('use db_20160111') # works fine
cursor.execute('show databases') # works fine
cursor.execute('SELECT cgi FROM db_20160111.useful_result') # crashes

with the following error:

OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState='08S01', infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:29:28', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:314', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:146', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:173', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:256', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:376', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:357', 'sun.reflect.GeneratedMethodAccessor308:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:79', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:37', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:64', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:415', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1628', 'org.apache.hadoop.hive.shims.HadoopShimsSecure:doAs:HadoopShimsSecure.java:536', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:60', 'com.sun.proxy.$Proxy23:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:234', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:401', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:206', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

Thanks!

TypeError: execute() got an unexpected keyword argument 'async'

Getting this error when I test your code

cursor = hive.connect(host="xx.xx.xx", port=10000, username="hdfs").cursor()
cursor.execute('SELECT * FROM investor.api LIMIT 10', async=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: execute() got an unexpected keyword argument 'async'

Better to be compatible with older version of python-hive interface

Hi there, my team recently encountered a problem while updrading from hiveserver1 to hiveserver2, which is all methods in our Python code have to be changed from older version style(from ThriftHive) to the new PyHive style, such as from fetchAll to fetchall, from fetchN to fetchmany, etc. Also, when executing a query which doesn't return any results(like INSERT), the new PyHive will raise an error while the older interface will return an empty list.

So I just think it would be better if PyHive could provide the same coding ways to be compatible with older ThriftHive code, so that programmers can easily migrate their work to PyHive.

Thanks.

Installation problem:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 42

Hello,
My laptop is running Ubuntu 14.04 server 6.4,and python2.7-dev has been installed,when install pyhive,blow errors got printed:

Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/sasl/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-hRzjMT-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/sasl
Traceback (most recent call last):
  File "/usr/bin/pip", line 9, in <module>
    load_entry_point('pip==1.5.4', 'console_scripts', 'pip')()
  File "/usr/lib/python2.7/dist-packages/pip/__init__.py", line 235, in main
    return command.main(cmd_args)
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 42: ordinal not in range(128)

There have a synatx error when import hive model after install pyhive.

Hi, man. I had install Pyhive on my server. Also there out put the success log:

Installing collected packages: thrift-sasl, pyhive
  Running setup.py install for thrift-sasl ... done
  Running setup.py install for pyhive ... done
Successfully installed pyhive-0.2.1 thrift-sasl-0.2.1

But when I try to test import, there said SyntaxError: invalid syntax. The detail info see below:

Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyhive import hive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/pyhive/hive.py", line 13, in <module>
    from pyhive import common
  File "/usr/lib/python2.6/site-packages/pyhive/common.py", line 219
    return {k: self.escape_item(v) for k, v in parameters.items()}
                                     ^

So is there any wrong operation in my step? Thanks.

New conda pkg

This isn't really an issue so much as announcement? I built a conda pkg with all the dependencies:

  • sasl
  • thrif_sasl
  • TCLIService

Using conda:

conda install -c https://conda.binstar.org/blaze pyhive

This should work for linux-64 and osx

pyhive.exc.OperationalError: TExecuteStatementResp

hon connection hive select data into pandas
get one exception.

my code:

`
# -- coding: utf-8 --
from pyhive import hive
from impala.util import as_pandas
from string import Template

config = {
    'host': '127.0.0.1',
    'database': 'default'
}

def get_conn(conf):
    conn = hive.connect(**conf)
    return conn

def execute_hql(hql, params = None):
    conn = get_conn(config)
    cursor = conn.cursor()
    hql = Template(hql).substitute(params)
    cursor.execute(hql)
    df = as_pandas(cursor)
    return df`

test.py

`
# -- coding: utf-8 --
from pyhive import hive
from impala.util import as_pandas
import DB.hive_engines

hql = """
    SELECT
        keywords,
        count(keywords)
    FROM
        table
    WHERE
        eventname = 'xxx' AND
        cdate >= '$start_date' AND
        cdate <= '$end_date'
    GROUP BY
        keywords
"""

if __name__ == '__main__':
    params = {'start_date': '2016-04-01', 'end_date': '2016-04-03'}
    df = DB.hive_engines.execute_hql(hql, params)
    print df`

error message

pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask', sqlState='08S01', infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask:28:27', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:326', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:146', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:173', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:268', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:410', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:391', 'sun.reflect.GeneratedMethodAccessor31:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:415', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1671', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy27:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:245', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:509', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:285', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615', 'java.lang.Thread:run:Thread.java:745'], statusCode=3), operationHandle=None)

Thanks!

Usage in Caravel

I use the PyHive Package to connect Caravel to Hive. This works well, but for some diagrams the query fails. The problem is, that the alias created in the select is not used for ordering. I am not sure whether this is a PyHive or a Caravel Problem. An example below

Wrong query
SELECT sourceASsource, targetAStarget, SUM(value) AS mysum FROMsankeyGROUP BYsource, target ORDER BY SUM(value) DESC LIMIT 50000

This should have been gerneated
SELECT sourceASsource, targetAStarget, SUM(value) AS mysum FROMsankeyGROUP BYsource, targetORDER BYmysum DESC LIMIT 50000

Inserting NULL values into Hive

Hi, thanks for making this software available to all. I am reaching out to see if I can get help with an issue I am having. I am trying to upload a pandas dataframe to Hive, but I run into a problem when the dataframe has None values.

from sqlalchemy import Column, Table, MetaData, types
from sqlalchemy.engine import create_engine
import contextlib

import pandas as pd
df = pd.DataFrame([['a', 'b', 'c'],['d', None, 'f'],['g', 'h', 'i']], 
                  columns=['col1', 'col2', 'col3'])

engine = create_engine('hive://user@host:10000/default')
try:
    with contextlib.closing(engine.connect()) as connection:

        cols = []
        for name, dtype in df.dtypes.iteritems():
            cols.append(Column(name, getattr(sqlalchemy.types, 'String')))

        table = Table('test_table', MetaData(bind=engine), *cols, schema='default')
        table.drop(checkfirst=True)
        table.create()

        ins = table.insert(df.to_dict('records'))
        connection.execute(ins)

        result = table.select().execute().fetchall()
        print result
finally:
    engine.dispose()

The code above results in the following error:

ProgrammingError: (pyhive.exc.ProgrammingError) Unsupported object None [SQL: u'INSERT INTO TABLE `default`.`test_table` VALUES (%(col1_0)s, %(col2_0)s, %(col3_0)s), (%(col1_1)s, %(col2_1)s, %(col3_1)s), (%(col1_2)s, %(col2_2)s, %(col3_2)s)'] [parameters: {u'col2_2': 'h', u'col2_1': None, u'col2_0': 'b', u'col1_0': 'a', u'col1_1': 'd', u'col1_2': 'g', u'col3_2': 'i', u'col3_0': 'c', u'col3_1': 'f'}]

Any help would be greatly appreciated. Thanks!

Session properties via REST API

Is it currently possible to modify session properties via the REST API? If not, would it be possible to add support for it?

Unsupported mechanism type PLAIN

when connect to hive (use Kerborse) throws this exception:
thrift.transport.TTransport.TTransportException: Bad status: 3 (Unsupported mechanism type PLAIN)

INSERT INTO TABLE

Insert statements use the syntax

INSERT INTO A (x, y, z), 
SELECT B.x, B.y, B.z
FROM B

Which, while default in SQL, lacks the extra TABLE keyword that seems to be expected by Hive

INSERT INTO TABLE A (x, y, z), 
SELECT B.x, B.y, B.z
FROM B

can't read Hive NULLs to Decimal

There does not seem to be a check for None in HiveDecimal.process_result_value().
If there is a NULL in a Hive column expected as Decimal, it causes TypeError: Cannot convert None to Decimal

Shouldn't this return NaN instead?

PyHive and Transport mode - HTTP

Our server is configured with hive.server2.transport.mode set to HTTP. When switching to binary , everything seems to work perfectly. Is there a way to enable PyHive to work with HTTP transport mode ?

setup.py install_dependencies missing dependencies from dev_dependencies

First, thanks for making this. It's appreciated. I think it'll help me out a lot.

After running pip install pyhive I got errors about missing thrift module. I had to install everything in dev_dependencies.txt before the example code would work, e.g.:

requests>=1.0.0
sasl>=0.1.3
sqlalchemy>=0.5.8
thrift>=0.8.0
thrift_sasl>=0.1.0

I'm thinking these should be part of setup.py's install_dependencies, but it seems like maybe I could possibly be doing something wrong. I can submit a PR with the fix if you indeed agree and don't want to do it yourself.

AttributeError: 'Cursor' object has no attribute 'poll'

For some reason your code breaks on db api example -->

cursor = hive.connect(host="xx.xxx.xx.xx", port=10000, username="hdfs").cursor()
cursor.execute('SELECT * FROM investor.api LIMIT 10')

status = cursor.poll().operationState
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'Cursor' object has no attribute 'poll'

Requirements for pypi package are not complete

Using pip pyhive is installed correctly but not all dependencies are included.

     ----> 4 from pyhive import hive

/home/user/.virtualenvs/cluster/local/lib/python2.7/site-packages/pyhive/hive.py in <module>()
      8 from __future__ import absolute_import
      9 from __future__ import unicode_literals
---> 10 from TCLIService import TCLIService
     11 from TCLIService import constants
     12 from TCLIService import ttypes

/home/user/.virtualenvs/cluster/local/lib/python2.7/site-packages/TCLIService/TCLIService.py in <module>()

----> 9 from thrift.Thrift import TType, TMessageType, TException, TApplicationException
     10 from ttypes import *
     11 from thrift.Thrift import TProcessor

ImportError: No module named thrift.Thrift

Installing Thrift then I get

----> 4 from pyhive import hive

/home/user/.virtualenvs/cluster/local/lib/python2.7/site-packages/pyhive/hive.py in <module>()
     18 import getpass
     19 import logging
---> 20 import sasl
     21 import sys
     22 import thrift.protocol.TBinaryProtocol

ImportError: No module named sasl

Installing SASL, now it needs

----> 5 from pyhive import hive

/home/user/.virtualenvs/cluster/local/lib/python2.7/site-packages/pyhive/hive.py in <module>()
     22 import thrift.protocol.TBinaryProtocol
     23 import thrift.transport.TSocket
---> 24 import thrift_sasl
     25 
     26 # PEP 249 module globals

ImportError: No module named thrift_sasl

and finally installing thrift_sasl I can import pyhive.

Pass a job.queue to Hive connection

I'm connecting to a Hive server that has queues being used. I can do a show databases using pyhive, but when I try to do a SELECT I get the error

TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState='08S01', infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:29:28'

(I'm using Tez on this cluster). I believe that the error is caused by the queue structure. How can I pass a job.queue to the connect object?

For example, on beeline I have to pass the argument

-n hive job.queue.name=myqueue

to execute my jobs.

Is Hive "CREATE TABLE" execution working ?

I'm trying the following code :

from pyhive import hive
cursor = hive.connect(ip).cursor()
cursor.execute(""""CREATE TABLE test_create_pyhive_1 ( a INT)""")

It gives me this kind of error :
ParseException line 1:43 character '<EOF>' not supported here

You can find the full trace here :

Traceback (most recent call last):
  File "/home/jerome/test/manual/test_pyhive_create_table.py", line 13, in <module>
    cursor.execute(""""CREATE TABLE test_create_pyhive_1 ( a INT)""")
  File "/home/jerome/virtenv/lib/python2.7/site-packages/pyhive/hive.py", line 240, in execute
    _check_status(response)
  File "/home/jerome/virtenv/lib/python2.7/site-packages/pyhive/hive.py", line 362, in _check_status
    raise OperationalError(response)
pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=40000, errorMessage="Error while compiling statement: FAILED: ParseException line 1:43 character '<EOF>' not supported here", sqlState='42000', infoMessages=
["*org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: ParseException line 1:43 character '<EOF>' not supported here:28:27", 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:374', 'org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:136', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:206', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:316', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:425', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:395', 'sun.reflect.GeneratedMethodAccessor44:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:415', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1693', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy25:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:245', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:506', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:285', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615', 'java.lang.Thread:run:Thread.java:745', "*org.apache.hadoop.hive.ql.parse.ParseException:line 1:43 character '<EOF>' not supported here:33:6", 'org.apache.hadoop.hive.ql.parse.ParseDriver:parse:ParseDriver.java:210', 'org.apache.hadoop.hive.ql.parse.ParseDriver:parse:ParseDriver.java:166', 'org.apache.hadoop.hive.ql.Driver:compile:Driver.java:423', 'org.apache.hadoop.hive.ql.Driver:compile:Driver.java:311', 'org.apache.hadoop.hive.ql.Driver:compileInternal:Driver.java:1194', 'org.apache.hadoop.hive.ql.Driver:compileAndRespond:Driver.java:1181', 'org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:134'], statusCode=3), operationHandle=None)

I saw in the code, that in the tests "create table" is executed from hive shell. So, my question is, "is it possible to create table with pyhive ?"

Row Transfer performance

Hey this is not so much an issue as a question about expectations. Against our hiveserver I'm able to transfer appx 4-8k rows per second after the Hive query is complete and we are transferring results out. However if I write to an HDFS directory and do an hadoop fs -cat through SSH I can get 500-700k rows per second over the same network connection. This seems like a tremendous overhead. Is it due to thrift ?

support for brewed hive?

setting up connections using a homebrew version of hive fails because TCLIService is not in the location pyhive is expected it to be in.

pyhive is expecting /usr/lib/hive/lib/py, brew's location is /usr/local/Cellar/hive/version_num/libexec/lib/py.

Or would the solution be to just symlink the py directory specifically, instead of symlinking from the version? (issue is because brew has a libexec folder which seems to be missing from pyhive's expectation)

Typo in line 299, where row.tab_name is referenced to fetch table_name

When connecting to hive with sql_alchemy an AttributeError is thrown. This Could not locate column in row for column 'tab_name'. I discovered this issue by registering pyhive with sql_alchemy. Then using a sql_alchemy url hive://hostname:10000/database_name to connect to my hive database in airbnb/caravel and then testing the connection.

INSERT/executemany fails

I am trying to use pandas to insert a batch of data to a Hive table and it bombs after the first insert.
PyHive seems to try to get a result set after each insert and does not get one, breaking the executemany:

File "/usr/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1160, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 571, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 1250, in to_sql
    table.insert(chunksize)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 770, in insert
    self._execute_insert(conn, keys, chunk_iter)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 745, in _execute_insert
    conn.execute(self.insert_statement(), data)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _execute_context
    context)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 447, in do_executemany
    cursor.executemany(statement, parameters)
  File "/usr/anaconda2/lib/python2.7/site-packages/pyhive/common.py", line 84, in executemany
    self._fetch_more()
  File "/usr/anaconda2/lib/python2.7/site-packages/pyhive/hive.py", line 228, in _fetch_more
    raise ProgrammingError("No result set")
sqlalchemy.exc.ProgrammingError: (pyhive.exc.ProgrammingError) No result set [SQL: u'INSERT INTO TABLE

But there is another issue, which is that this is not performant at all in the way the batch insert is generated. It generates a separate insert per row, which causes Hive to create a MR job for each row. Is there a better way to handle a batch insert like this?

can't pip install

I can't pip install pyhive on a windws box:

C:\Users\atrombley>pip install pyhive
Downloading/unpacking pyhive
Running setup.py (path:c:\users\atromb~1\appdata\local\temp\2\pip_build_atromb
ley\pyhive\setup.py) egg_info for package pyhive

Downloading/unpacking future (from pyhive)
Running setup.py (path:c:\users\atromb~1\appdata\local\temp\2\pip_build_atromb
ley\future\setup.py) egg_info for package future

warning: no files found matching '*.au' under directory 'tests'
warning: no files found matching '*.gif' under directory 'tests'
warning: no files found matching '*.txt' under directory 'tests'

Installing collected packages: pyhive, future
Running setup.py install for pyhive

Could not find .egg-info directory in install record for pyhive
Cleaning up...
Exception:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\pip\basecommand.py", line 122, in main
status = self.run(options, args)
File "C:\Python27\lib\site-packages\pip\commands\install.py", line 283, in run

requirement_set.install(install_options, global_options, root=options.root_p

ath)
File "C:\Python27\lib\site-packages\pip\req.py", line 1435, in install
requirement.install(install_options, global_options, _args, *_kwargs)
File "C:\Python27\lib\site-packages\pip\req.py", line 749, in install
os.remove(record_filename)
WindowsError: [Error 32] The process cannot access the file because it is being
used by another process: 'c:\users\atromb~1\appdata\local\temp\2\pip-ndmn
vn-record\install-record.txt'

Storing debug log for failure in C:\Users\atrombley\pip\pip.log

fetchmany argument ignored?

I am using PyHive and reading from a table with 527,000 rows, which takes quite a long time to read.
In trying to optimize the process, I found the following timings:

fetchmany(1000) takes 4.2s
fetchmany(2000) takes 8.4s
fetchmany(500) takes 4.2s
fetchmany(500) takes 0.02s if directly preceded by the other fetchmany(500)

It seems like the batch size is 1000 regardless of the argument to fetchmany(). Is this the prescribed behavior? Is there an "under the hood" way to change this to optimize batched reads? Is there a way to "prefetch" so that data can be pipelined?

thanks!

Hive UDF in Python is forced to use Python2

My team is developing Hive based workflow using Python UDF and is running it through PyHive. The UDF is written in Python3 but it didn't work (which worked fine from Hive shell). After a while (long hours of debugging), we finally figured that it is forced to use Python2 somehow (shebang line in Python script was ignored). Any idea why this is the case?

BTW thanks for the great work here!

Thanks!
Keeyong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.