crflynn / databricks-dbapi Goto Github PK
View Code? Open in Web Editor NEWDBAPI and SQLAlchemy dialect for Databricks Workspace and SQL Analytics clusters
License: MIT License
DBAPI and SQLAlchemy dialect for Databricks Workspace and SQL Analytics clusters
License: MIT License
I use Windows with Python 3.9 so I can't install sasl package. SASL is extremely hard to compile on Windows, also, there's no compiled wheels for Python newer than 3.7.
I spent few days and found out that hive connection on Azure Databricks DOES NOT USE sasl. Then, I tried to open hive connection with SqlAlchemy and it worked. But the same code from this package didn't worked.
Solution to run databricks connection on Windows:
--no-deps
option), install them manually (thrift, six, pyhive etc)--no-deps
option, because pyhive will try to install sasl
site-packages\PyHive-0.6.4.dist-info\METADATA
file and comment out one line which tells python that sasl is required, so result should look like this:<!-- Requires-Dist: sasl (>=0.2.1) ; extra == 'hive' -->
@crflynn You could add this solution in readme, because whole Internet points to precompiled sasl packages, which are ends with Python 3.7.
Hello! Thanks for making this it has been very helpful and I have a suggestion for an improvement.
We have found the pyodbc
SQL Alchemy dialect quite problematic. e.g. case statements on boolean columns add inserts don't seem to work.
query = session.query(case([(Model.some_boolean_column, 1)],else_=0) )
session.add(Model(column="value"))
session.commit()
I believe this is because most of the pyodbc
dialect is just inheriting the pyhive
library. We want to use Databricks SQL endpoints so I found a workaround adding support for SQL endpoints with the pyhive mode. https://github.com/Tom-Newton/databricks-dbapi?organization=Tom-Newton&organization=Tom-Newton . It was quite simple to do really and I thought you might want to make a similar change to this primary repo so others can benefit. I just used the Databricks SQL Connector for Python for making pyhive connections. This supports general purpose compute clusters and SQL endpoints.
Hello,
The module is awesome, just does it support https without certificate validation?
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1091)
SQL Analytics has become the most important cluster to connect from tools like Apache Superset for us. We are currently able to use this driver to connect our Superset instance to Databricks on both AWS and Azure. However, this doesn't seem to support SQL analytics clusters for now.
Are there any plans to add support for them? Also, would it make sense to switch to the ODBC drivers?
In other sqlalchemy dialects their name and driver name are just strings. Even in hive dialect. https://github.com/dropbox/PyHive/blob/b21c507a24ed2f2b0cf15b0b6abb1c43f31d3ee0/pyhive/sqlalchemy_hive.py#L381
Both hive and odbc dialects.
See
These names should be strings instead.
Thanks for creating this package! We're investigating using it in OpenSAFELY.
This may not be an easy issue to solve, and certainly won't be high priority for you, but just wanted to flag up that the way the dependencies of this package are configured has caused us some problems. Specifically, the dependency on pyhive[hive]
pulls in sasl
which we don't need and which is awkward to install because it has no wheel and needs the libsasl2
headers to compile.
To work around them we've had to fork the package and edit the dependencies (opensafely-core#1):
https://github.com/opensafely-core/databricks-dbapi/pull/1/files
It's possible that future updates to pip and pip-tools will give us other ways to work around this issue without needing to fork. But I thought it was worth mentioning this problem in case an easier solution occurs to you.
Hi there!
I'm going through your example of using this with SQLAlchemy. I've got a Databricks cluster spun up, have generated a PAT, and am able to successfully connect. However, whenever I try to do something like
Table("my_table", MetaData(bind=engine), autoload=True)
I get
expected string or bytes-like object
Traceback (most recent call last):
File "/scripts/pipelines/airflow/operators/mysql_to_detabricks_operator.py", line 137, in databricks_table_sa
self._databricks_table, MetaData(bind=self.databricks_sqla), autoload=True
File "<string>", line 2, in __new__
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/deprecations.py", line 139, in warned
return fn(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 559, in __new__
metadata._remove_table(name, schema)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
exc_value, with_traceback=exc_tb,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
raise exception
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 554, in __new__
table._init(name, metadata, *args, **kw)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 648, in _init
resolve_fks=resolve_fks,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 689, in _autoload
_extend_on=_extend_on,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2221, in run_callable
return conn.run_callable(callable_, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1660, in run_callable
return callable_(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 467, in reflecttable
table, include_columns, exclude_columns, resolve_fks, **opts
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 666, in reflecttable
table_name, schema, **table.dialect_kwargs
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 392, in get_columns
self.bind, table_name, schema, info_cache=self.info_cache, **kw
File "/usr/local/lib/python3.7/site-packages/pyhive/sqlalchemy_hive.py", line 319, in get_columns
col_type = re.search(r'^\w+', col_type).group(0)
File "/usr/local/lib/python3.7/re.py", line 185, in search
return _compile(pattern, flags).search(string)
Digging a little bit, I've discovered that the return of DESCRIBE my_table
returns something like:
whereas this line tells me it's expecting to break on "# Partition Information". Therefore col_type
ends up being None
when the re.search
is being run.
What am I missing? I've inherited this code, and I'm new to whole pyspark/hive/etc space so I'm not sure this library is even the culprit. Please let me know what other information I can provide. FWIW here's my pip freeze
alabaster==0.7.12
alembic==1.4.2
amqp==2.6.1
ansiwrap==0.8.4
apache-airflow==1.10.9
apispec==1.3.3
appdirs==1.4.3
appnope==0.1.2
argcomplete==1.11.1
arrow==0.14.2
asn1crypto==0.24.0
astroid==2.2.5
async-generator==1.10
attrs==19.3.0
Authlib==0.14.3
autopep8==1.5.4
avro-python3==1.10.0
aws-xray-sdk==0.95
azure-common==1.1.8
azure-nspkg==2.0.0
azure-storage-blob==0.37.1
azure-storage-common==0.37.1
azure-storage-nspkg==2.0.0
Babel==2.8.0
backcall==0.2.0
bcrypt==3.1.7
billiard==3.6.3.0
black==20.8b1
boto==2.49.0
boto3==1.7.84
botocore==1.10.84
bumpversion==0.5.3
cached-property==1.5.1
cachetools==4.1.0
cattrs==0.9.2
celery==4.4.4
certifi==2020.4.5.1
cffi==1.14.0
cfgv==3.2.0
chardet==3.0.4
click==7.1.2
colorama==0.4.3
colorlog==4.0.2
configparser==3.5.3
cookies==2.2.1
coverage==4.5.3
croniter==0.3.32
cryptography==2.8
cycler==0.10.0
databricks-connect==6.5.1
databricks-dbapi==0.3.0
databricks-test==0.0.4
decorator==4.4.2
defusedxml==0.6.0
dill==0.3.1.1
distlib==0.3.1
dnspython==1.16.0
doc8==0.8.1
docker==4.4.1
docopt==0.6.2
docutils==0.16
ecdsa==0.16.1
email-validator==1.1.1
entrypoints==0.3
factory-boy==2.12.0
Faker==1.0.8
filelock==3.0.12
flake8==3.8.4
Flask==1.1.2
Flask-Admin==1.5.4
Flask-AppBuilder==2.3.4
Flask-Babel==1.0.0
Flask-Bcrypt==0.7.1
Flask-Caching==1.3.3
Flask-JWT-Extended==3.24.1
Flask-Login==0.4.1
Flask-OpenID==1.2.5
Flask-SQLAlchemy==2.4.3
flask-swagger==0.2.13
Flask-WTF==0.14.3
flower==0.9.4
freezegun==1.0.0
funcsigs==1.0.2
future==0.16.0
futures==3.1.1
google-api-core==1.14.2
google-api-python-client==1.7.11
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-oauth==1.0.1
googleapis-common-protos==1.51.0
graphviz==0.14
gunicorn==19.10.0
hmsclient==0.1.1
httplib2==0.18.1
humanize==0.5.1
identify==1.5.11
idna==2.8
ijson==3.0.4
imagesize==1.2.0
importlib-metadata==1.7.0
iniconfig==1.0.0
ipykernel==5.1.4
ipython==7.19.0
ipython-genutils==0.2.0
iso8601==0.1.12
isort==4.3.21
itsdangerous==1.1.0
JayDeBeApi==1.2.1
jedi==0.18.0
Jinja2==2.10.3
jmespath==0.10.0
joblib==1.0.0
JPype1==0.7.5
json-merge-patch==0.2
jsondiff==1.1.1
jsonpickle==1.4.2
jsonschema==3.2.0
jupyter-client==6.1.11
jupyter-core==4.7.0
kiwisolver==1.3.1
kombu==4.6.11
lazy-object-proxy==1.4.1
lockfile==0.12.2
m2r==0.2.1
Mako==1.1.3
Markdown==2.6.11
MarkupSafe==1.1.1
marshmallow==2.21.0
marshmallow-enum==1.5.1
marshmallow-sqlalchemy==0.23.1
matplotlib==3.1.2
mccabe==0.6.1
mirakuru==2.3.0
mistune==0.8.4
mock==4.0.3
moto==1.3.4
mypy-extensions==0.4.3
mysqlclient==1.4.4
natsort==7.0.1
nbclient==0.5.1
nbformat==5.0.8
nest-asyncio==1.4.3
nodeenv==1.5.0
numpy==1.18.4
packaging==20.8
pandas==0.25.3
papermill==2.0.0
paramiko==2.7.1
parso==0.8.1
pathspec==0.8.1
pbr==5.5.1
pendulum==1.4.4
pexpect==4.8.0
pickleshare==0.7.5
pluggy==0.13.1
port-for==0.4
pre-commit==2.9.3
prison==0.1.3
prompt-toolkit==3.0.10
protobuf==3.12.2
psutil==5.7.0
ptyprocess==0.7.0
py==1.10.0
py4j==0.10.7
pyaml==20.4.0
pyarrow==0.17.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
PyAthena==1.6.1
pycodestyle==2.6.0
pycparser==2.20
pycryptodome==3.9.9
pycryptodomex==3.9.7
pyflakes==2.2.0
Pygments==2.6.1
PyHive==0.6.2
PyJWT==1.7.1
pykwalify==1.7.0
pylint==2.3.1
pymongo==3.8.0
PyNaCl==1.4.0
pyOpenSSL==19.1.0
pyparsing==2.4.7
pyrsistent==0.16.0
pysftp==0.2.9
pytest==6.2.1
pytest-freezegun==0.4.2
pytest-mongo==1.2.1
pytest-mysql==2.0.1
pytest-sftpserver==1.3.0
python-daemon==2.1.2
python-dateutil==2.8.1
python-decouple==3.4
python-editor==1.0.4
python-jose==2.0.2
python3-openid==3.1.0
pytz==2020.1
pytzdata==2019.3
PyYAML==5.1.1
pyzmq==20.0.0
redis==3.5.3
regex==2020.11.13
requests==2.23.0
requests-mock==1.7.0
responses==0.12.1
restructuredtext-lint==1.3.2
rsa==4.0
s3transfer==0.1.13
sasl==0.2.1
scikit-learn==0.22.1
scipy==1.6.0
setproctitle==1.1.10
simple-salesforce==1.10.1
six==1.15.0
slackclient==1.3.0
snowballstemmer==2.0.0
snowflake-connector-python==2.0.0
Sphinx==3.3.1
sphinx-rtd-theme==0.5.0
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-confluencebuilder==1.3.0
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
SQLAlchemy==1.3.17
SQLAlchemy-JSONField==0.9.0
SQLAlchemy-Utils==0.36.6
sqlparse==0.3.0
sshtunnel==0.1.5
stevedore==3.3.0
tabulate==0.8.7
tenacity==4.12.0
termcolor==1.1.0
text-unidecode==1.2
textwrap3==0.9.2
thrift==0.13.0
thrift-sasl==0.4.2
toml==0.10.1
tornado==5.1.1
tqdm==4.55.1
traitlets==5.0.5
typed-ast==1.4.0
typing-extensions==3.7.4.2
tzlocal==1.5.1
unicodecsv==0.14.1
uritemplate==3.0.1
urllib3==1.25.10
vine==1.3.0
virtualenv==20.2.2
wcwidth==0.2.5
websocket-client==0.54.0
Werkzeug==0.16.0
wrapt==1.11.1
WTForms==2.2.1
xmltodict==0.12.0
zdesk==2.7.1
zipp==3.1.0
zope.deprecation==4.4.0
Hi there,
I am getting the following error when trying to install databricks-dbapi on my Databricks Cluster on Azure:
List(/databricks/python/bin/pip, install, databricks-dbapi[sqlalchemy]==0.5.0, --disable-pip-version-check) exited with code 1. ERROR: Command errored out with exit status 1:
In file included from sasl/saslwrapper.cpp:254:0:
sasl/saslwrapper.h:22:10: fatal error: sasl/sasl.h: No such file or directory
#include <sasl/sasl.h>
^~~~~~~~~~~~~
compilation terminated.
Seems like sasl lib is missing, any idea how it can be solved? I tried Databricks version 7.5 and 8.0 Beta, problem is the same for both
Hello.
I have been trying to install databricks-dbapi to no avail. I tried installing libsasl2 as you recommended here but there continues to be a barrage of errors and warnings (please see ErrorLog.txt )
Can you please assist? Thank you
macOS version 11.3.1
Python version 3.9.5
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.