moj-analytical-services / pydbtools Goto Github PK

Python version of dbtools

Home Page: https://moj-analytical-services.github.io/pydbtools/

Python 100.00%

pydbtools's Introduction

pydbtools

A package that is used to run SQL queries speficially configured for the Analytical Platform. This packages uses AWS Wrangler's Athena module but adds additional functionality (like Jinja templating, creating temporary tables) and alters some configuration to our specification.

Installation

Requires a pip release above 20.

## To install from pypi
pip install pydbtools

## Or install from git with a specific release
pip install "pydbtools @ git+https://github.com/moj-analytical-services/[email protected]"

Quickstart guide

The examples directory contains more detailed notebooks demonstrating the use of this library, many of which are borrowed from the mojap-aws-tools-demo repo.

Read an SQL Athena query into a pandas dataframe

import pydbtools as pydb
df = pydb.read_sql_query("SELECT * from a_database.table LIMIT 10")

Run a query in Athena

response = pydb.start_query_execution_and_wait("CREATE DATABASE IF NOT EXISTS my_test_database")

Create a temporary table to do further separate SQL queries on later

pydb.create_temp_table("SELECT a_col, count(*) as n FROM a_database.table GROUP BY a_col", table_name="temp_table_1")
df = pydb.read_sql_query("SELECT * FROM __temp__.temp_table_1 WHERE n < 10")

pydb.dataframe_to_temp_table(my_dataframe, "my_table")
df = pydb.read_sql_query("select * from __temp__.my_table where year = 2022")

Notes

Amazon Athena using a flavour of SQL called presto docs can be found here
To query a date column in Athena you need to specify that your value is a date e.g. SELECT * FROM db.table WHERE date_col > date '2018-12-31'
To query a datetime or timestamp column in Athena you need to specify that your value is a timestamp e.g. SELECT * FROM db.table WHERE datetime_col > timestamp '2018-12-31 23:59:59'
Note dates and datetimes formatting used above. See more specifics around date and datetimes here
To specify a string in the sql query always use '' not "". Using ""'s means that you are referencing a database, table or col, etc.
If you are working in an environment where you cannot change the default AWS region environment variables you can set AWS_ATHENA_QUERY_REGION which will override these

See changelog for release changes.

pydbtools's People

Contributors

Stargazers

Watchers

Forkers

uk-gov-mirror aiwaken-ltd

pydbtools's Issues

Support for Python 3.11

There's no reason for this library not to be available for Python 3.11 now that awswrangler supports it - except that arrow_pd_parser has a similar constraint.

CCDE-332: Create ability to drop glue table and contents in one function

pydbtools should have a function to drop table and its s3 contents you can do this in aws-wrangler.

I wrote a basic version just in a random script for my current project but it would be good to be properly in pydbtools.

def big_drop(table, db_name):
    path = wr.catalog.get_table_location(database=db_name, table=table)
    wr.s3.delete_objects(path)
    wr.catalog.delete_table_if_exists(database=db_name, table=table)

Would also be good to…

Add a function to drop a partition rather than a table I think aws-wrangler does the same thing as above but just for a specific partition within a table.
Add a function to drop a whole database basically gets all the table names from a database, iterates through all the table names and calls the delete table function as mentioned above. Then finally deletes the glue database.

Should mock or moto the athena response and sts

Loosen awswrangler dependencies to allow for pyarrow 14.0.1

CCDE-293 pydbtools needs to be updated to latest awswrangler

There was an error where aws wrangler would not convert parquet dates correctly. This has been fixed in AWS Data Wrangler 2.12.0 (PR that fixed it for context) so we need to point pydbtools to at least this release.

Check that it functions correctly run the workaround that I posted in the PR:

import awswrangler as wr
df = wr.athena.read_sql_query(
    "SELECT mojap_end_datetime xhibit_v1.cases where mojap_latest_record LIMIT 10",
    database="database",
    pyarrow_additional_kwargs={"coerce_int96_timestamp_unit": "ms", "timestamp_as_object": True}
)
df.head()

You can use any SQL query that has an mojap_end_datetime as the default value is 2999-01-01 00:00:00 for latest records. This query should give back correct timestamps (2999-01-01 00:00:00).

If it works you will need to update the pyproject.toml to the relevant awrangler dependency.

Be able to add parquet files to temp_table

Something like:

import pydbtools as pydb

pydb.add_table_to_temp_db(s3_path="s3://path-to-parquet-file/", table_name="table1")

Code already doing something similar

This might be possible already (think this should exist but adding an example for short term if you want to do this with v3 release). Need to check if this works but I think it should??

import awswrangler as wr
import pydbtools as pydb

user_id, out_path = pydb.get_user_id_and_table_dir()
temp_db_name = pydb.get_database_name_from_userid(user_id)

columns_types, partitions_types, partitions_values = wr.s3.store_parquet_metadata(
    path="s3://path-to-parquet-file/",
    database=temp_db_name,
    table="table1",
    dataset=True
)

`ValueError: Bool column has NA values in column 59`

This bit of code fails for me:
hocas_all_off_disp = pydb.read_sql(f"SELECT * FROM hocas_dev_v1.all_off_disp limit 10")
The error:
ValueError: Bool column has NA values in column 59

Protect the master branch

I inadvertently pushed a change to the master branch by mistake, now reverted - only black formatting, but probably shouldn't happen.

Catch quoted temp reference in SQL

Should just have something to catch this instance and raise an error. No time for anything else.

Originally posted by @isichei in #27 (comment)

pydbtools doesn't currently support ARRAY or STRUCT type

This is because Athena default output is CSV that doesn't really encode that 🤷‍♂

This will be fixed in a future update but probably need to check for R compatibility with how/if it encodes lists and/or vectors in their Dataframe/Tibble/data.table.

If it doesn't not the end of the world just throw an error saying why are you using R?

CCDE-389: Change default branch from master to main

CREATE TEMP TABLE

We should add this functionality:

import pydbtools as pydb

pydb.create_temp_table(sql, table_name)

The above function would wrap their sql into:

f"""
CREATE TABLE user_name.{table_name} AS
WITH properties (
s3_location, <Same location as where we dump user tables>
format='parquet'
)
{sql}
"""

Thoughts on the above:

Database name collisions.
- When this package runs a query it uses boto3 sts client to get the iam role. That should probably be used to create the database username. This would easily avoid database collisions.
- With the above would you still expect users to specify the database name everytime they want to reference a temporary table? If so you could create your own string naming like FROM #temp.table1 where pydbtools replaces #temp with the actual database name.
You should add context management to this package so that when you exit pydbtools will delete the data and the database from the glue catalogue.
We are going to need a retension policy on temp tables in the glue catalogue. This is interesting (but defo solvable), should add this as an issue to trello once this functionality is added.

Single line (--) comments cannot be parsed by get_sql_from_file

This could just be me, but using pydbtools 4.0.1 (current release at time of writing), queries read in using get_sql_from_file fail to parse if they include single line -- comments:

import pydbtools as pydb

with open('test.sql', 'w') as f:
    f.write('--\nselect 1 AS one')
    
qry = pydb.get_sql_from_file("test.sql")
response = pydb.start_query_execution_and_wait(qry)
print(response.get("Status"))

CCDE-295: Allow CREATE TEMP TABLE query to be ran from a SQL file

At the moment pydbtools has a create_temp_table method. It would be useful to allow users to do this via an SQL file.

CREATE TEMPORARY TABLE <table name> AS (
  <normal SQL query>
)

Would then trigger:

pydb.create_temp_table(sql="<normal SQL query>", table_name = "<table name>")

CCDE-1078: Add SHOW TABLE etc functionality

awswrangler doesn't support SHOW... queries in read_sql_query

CCDE-300: Add CRON / AIRFLOW job to remove temp databases

On our platform, when pydbtools creates a temp database it prefixes the name of the glue schema as mojap_de_temp_<ts> currently these are not cleaned up.

You should write a python script using AWS wrangler that gets the database names and filters it by the prefix mojap_de_temp. Then any TS > 24 hours from the current script run TS is deleted. Then put this script on a daily run at say like 4am or something on Airflow.

pandas column-type overwrite

Allow option to overwrite standard conversion e.g.

{
  "int": "character",
  "long": "character"
}

Should also enable specific column type overwrite:

{
  "col2": "int"
}

☝️ First thinking is to always use agnostic meta data types to make it easier for the user to understand. However there is probably more flexibility in pandas datatypes so maybe there should be the option for both.

CCDE-294: need to provide database name to updated `read_sql_query` method

In the latest awswrangler package you need to provide a database in your queries instead of None (what we used to do).

To get a work around for this you will need to use the existing SQL parser functionality in the package to get a database name and then pass that database name into awswrangler if the user provides database=None (if they specify the database value then the package doesn't need to do anything)

Data types and speed

Need job that just does term frequency adjustment
Save out two join tables so we know how big they are
See whether they are broadcasting and whether we should hint the broadcast
How do we make the small tables shuffle to the big tables?

Data types and speed

The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.

However, it is possible to save results out to parquet using a create table as statement.

This has two benefits:

Read speeds of parquet from s3 are a lot faster (double or more) reading csv
The resultant dataframe in pandas is guaranteed to have the right data types.

One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.

I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.

Once we've done this, we should probably deprecate the python_athena_tools repo.

Should refactor to only using s3fs (remove gluejobutils dependency)

STRING TYPE not supported

Relates to #3

SHOW TABLES FROM database

Returns an athena datatype of string. Also output file is a .txt file.

pydbtools should recognise text outputs (instead of CSVs) and read in appropriately

CCDE-388: Allow users to provide/edit utils

In the AP move to EKS we need to provide a different way to set temp_database_name/location see the changes here

This hotfix branch should be the default in the next release.

Atm a lot of the utils are hardcoded or cannot be changed in order for others to change how these things are defined. I think it would be good to allow users to update \ change utils at runtime of pydbtools. Have a look at how aws-wrangler changes it’s settings (config). They essentially have a settings class in a private module then create an object of that class and expose that in a public module as the utils/settings. Then you can just edit the values/functions of that utils object.

Docs

Code

SHOW TABLES command doesn't work with new read_sql_query

The previous function pydbtools.read_sql would return information as a dataframe when issued with "SHOW TABLES IN". The new function read_sql_query returns a strange character.

read_sql

read_sql_query

Query caching issues - query returning results different to Athena GUI

I've run into an issue a few times where I run a query usin read_sql_query and get completely different results frmo the Athena GUI.

Restarting kernel and re-running fixes it.

I think this is likely to be a result of caching behaviour - possibly from awswrangler.

I haven't yet been able to create a fully reproducible example. The query I was running this morning was:

sql = """
select *
from data_linking_temp.nomis_csv_qa
limit 10
"""

df = pydb.read_sql_query(sql)

Note: This isn't simply a result of different rows being returned (which might be expected due to the distributed nature of the query). The query was returning the wrong columns and the wrong number of rows

Correct docstring for render_sql_template

... which incorrectly states the first argument is a file path rather than an sql query.

Would Arrow's csv reader speed things up

It's parallelised - here's an example of it running 10x faster than pandas:
https://youtu.be/fyj4FyH3XdU?t=1449

FileNotFoundError when using s3fs >= 0.3.0

Currently s3fs version >= 0.3.0 is causing issues (see pandas and s3fs related issues).

Note to reproduce each error restart your kernel before each run.

Error 1:

import pydbtools as pydb

 # Works
df1 = pydb.read_sql("SELECT * FROM db.table limit 10")

# Errors (same traceback as issues referenced above)
df2 = pydb.read_sql("SELECT * FROM db.table limit 10")

Error 2:

from gluejobutils import s3
import pandas as pd

no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"

pd.read_csv(no_file) # Same error as above

Error 3:

from gluejobutils import s3
import pandas as pd

no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"

df1 = pd.read_csv(is_file) # works as expected
s3.copy_s3_object(is_file, no_file)

df2 = pd.read_csv(no_file) # same error

Issue seems to be that s3fs is caching the list of objects in the bucket. So the first call works fine as there is no cache. The second call to the same bucket refers to the cache and sees that there is no file (as it was created post caching).

Think the interim solution for now is to get pydbtools to set up it's own s3fs.S3FileSystem and parse that to pandas.read_csv so that we can force clearing the cache in s3fs (i.e copy this)

Currently using s3fs 0.3.4, pandas 0.25., pydbtools 1.0.2, gluejobutils 3.0.0

document iam policy requirements

pydbtools (and indeed dbtools) outputs to the bucket alpha-athena-query-dump. While I think it will mostly be used interactively, for those who want to use it in an airflow task, it's worth adding a note in the readme that it'll need to be added to the iam policy in order to work

update awswrangler to allow pandas 2

Currently this installs awswrangler 2.x, which requires pandas <2

DECIMAL presto type not supported

import pydbtools as pydb

sql = """
SELECT CAST(0 AS DECIMAL(1,1)) as test
"""
out = pydb.read_sql(sql)

=== Error ===

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-1-f290daede81f> in <module>
      4 SELECT CAST(0 AS DECIMAL(1,1)) as test
      5 """
----> 6 out = pydb.read_sql(sql)

~/.local/lib/python3.6/site-packages/pydbtools/read_sql.py in read_sql(sql_query, timeout, *args, **kwargs)
     17 
     18     # Read in the SQL query
---> 19     dtype, parse_dates = _pd_dtype_dict_from_metadata(response['meta'])
     20     s3_path = response['s3_path'].replace('s3://', 's3a://')
     21     df = pd.read_csv(s3_path, dtype=dtype, parse_dates=parse_dates, *args, **kwargs)

~/.local/lib/python3.6/site-packages/pydbtools/utils.py in _pd_dtype_dict_from_metadata(athena_meta)
     57     for c in athena_meta:
     58         colname = c["name"]
---> 59         coltype = _athena_meta_conversions[c["type"]]['pandas']
     60         dtype[colname] = np.typeDict[coltype]
     61         if c["type"] in ["date", "timestamp"]:

KeyError: 'decimal'