Giter VIP home page Giter VIP logo

pydbtools's Introduction

pydbtools

A package that is used to run SQL queries speficially configured for the Analytical Platform. This packages uses AWS Wrangler's Athena module but adds additional functionality (like Jinja templating, creating temporary tables) and alters some configuration to our specification.

Installation

Requires a pip release above 20.

## To install from pypi
pip install pydbtools

## Or install from git with a specific release
pip install "pydbtools @ git+https://github.com/moj-analytical-services/[email protected]"

Quickstart guide

The examples directory contains more detailed notebooks demonstrating the use of this library, many of which are borrowed from the mojap-aws-tools-demo repo.

Read an SQL Athena query into a pandas dataframe

import pydbtools as pydb
df = pydb.read_sql_query("SELECT * from a_database.table LIMIT 10")

Run a query in Athena

response = pydb.start_query_execution_and_wait("CREATE DATABASE IF NOT EXISTS my_test_database")

Create a temporary table to do further separate SQL queries on later

pydb.create_temp_table("SELECT a_col, count(*) as n FROM a_database.table GROUP BY a_col", table_name="temp_table_1")
df = pydb.read_sql_query("SELECT * FROM __temp__.temp_table_1 WHERE n < 10")

pydb.dataframe_to_temp_table(my_dataframe, "my_table")
df = pydb.read_sql_query("select * from __temp__.my_table where year = 2022")

Notes

  • Amazon Athena using a flavour of SQL called presto docs can be found here
  • To query a date column in Athena you need to specify that your value is a date e.g. SELECT * FROM db.table WHERE date_col > date '2018-12-31'
  • To query a datetime or timestamp column in Athena you need to specify that your value is a timestamp e.g. SELECT * FROM db.table WHERE datetime_col > timestamp '2018-12-31 23:59:59'
  • Note dates and datetimes formatting used above. See more specifics around date and datetimes here
  • To specify a string in the sql query always use '' not "". Using ""'s means that you are referencing a database, table or col, etc.
  • If you are working in an environment where you cannot change the default AWS region environment variables you can set AWS_ATHENA_QUERY_REGION which will override these

See changelog for release changes.

pydbtools's People

Contributors

mratford avatar isichei avatar thomas-hirsch avatar gwionap avatar gustavmoller avatar mralecjohnson avatar robinl avatar nicholsondarius avatar davidbridgwood avatar s-ducks avatar

Stargazers

 avatar Ben K Muller avatar  avatar  avatar  avatar Ryan Kimble avatar Tom Hepworth avatar Leila Yousefi avatar Jimmy Briggs avatar Andrew avatar

Watchers

James Cloos avatar Joshua Rowe avatar  avatar  avatar Leila Yousefi avatar  avatar  avatar Leila Yousefi avatar

pydbtools's Issues

Support for Python 3.11

There's no reason for this library not to be available for Python 3.11 now that awswrangler supports it - except that arrow_pd_parser has a similar constraint.

CCDE-332: Create ability to drop glue table and contents in one function

pydbtools should have a function to drop table and its s3 contents you can do this in aws-wrangler.

I wrote a basic version just in a random script for my current project but it would be good to be properly in pydbtools.

def big_drop(table, db_name):
    path = wr.catalog.get_table_location(database=db_name, table=table)
    wr.s3.delete_objects(path)
    wr.catalog.delete_table_if_exists(database=db_name, table=table)

Would also be good to…

  • Add a function to drop a partition rather than a table I think aws-wrangler does the same thing as above but just for a specific partition within a table.

  • Add a function to drop a whole database basically gets all the table names from a database, iterates through all the table names and calls the delete table function as mentioned above. Then finally deletes the glue database.

CCDE-293 pydbtools needs to be updated to latest awswrangler

There was an error where aws wrangler would not convert parquet dates correctly. This has been fixed in AWS Data Wrangler 2.12.0 (PR that fixed it for context) so we need to point pydbtools to at least this release.

Check that it functions correctly run the workaround that I posted in the PR:

import awswrangler as wr
df = wr.athena.read_sql_query(
    "SELECT mojap_end_datetime xhibit_v1.cases where mojap_latest_record LIMIT 10",
    database="database",
    pyarrow_additional_kwargs={"coerce_int96_timestamp_unit": "ms", "timestamp_as_object": True}
)
df.head()

You can use any SQL query that has an mojap_end_datetime as the default value is 2999-01-01 00:00:00 for latest records. This query should give back correct timestamps (2999-01-01 00:00:00).

If it works you will need to update the pyproject.toml to the relevant awrangler dependency.

Be able to add parquet files to temp_table

Something like:

import pydbtools as pydb

pydb.add_table_to_temp_db(s3_path="s3://path-to-parquet-file/", table_name="table1")

Code already doing something similar

This might be possible already (think this should exist but adding an example for short term if you want to do this with v3 release). Need to check if this works but I think it should??

import awswrangler as wr
import pydbtools as pydb

user_id, out_path = pydb.get_user_id_and_table_dir()
temp_db_name = pydb.get_database_name_from_userid(user_id)

columns_types, partitions_types, partitions_values = wr.s3.store_parquet_metadata(
    path="s3://path-to-parquet-file/",
    database=temp_db_name,
    table="table1",
    dataset=True
)

Protect the master branch

I inadvertently pushed a change to the master branch by mistake, now reverted - only black formatting, but probably shouldn't happen.

pydbtools doesn't currently support ARRAY or STRUCT type

This is because Athena default output is CSV that doesn't really encode that 🤷‍♂

This will be fixed in a future update but probably need to check for R compatibility with how/if it encodes lists and/or vectors in their Dataframe/Tibble/data.table.

If it doesn't not the end of the world just throw an error saying why are you using R?

CREATE TEMP TABLE

We should add this functionality:

import pydbtools as pydb

pydb.create_temp_table(sql, table_name)

The above function would wrap their sql into:

f"""
CREATE TABLE user_name.{table_name} AS
WITH properties (
s3_location, <Same location as where we dump user tables>
format='parquet'
)
{sql}
"""

Thoughts on the above:

  • Database name collisions.
    • When this package runs a query it uses boto3 sts client to get the iam role. That should probably be used to create the database username. This would easily avoid database collisions.
    • With the above would you still expect users to specify the database name everytime they want to reference a temporary table? If so you could create your own string naming like FROM #temp.table1 where pydbtools replaces #temp with the actual database name.
  • You should add context management to this package so that when you exit pydbtools will delete the data and the database from the glue catalogue.
  • We are going to need a retension policy on temp tables in the glue catalogue. This is interesting (but defo solvable), should add this as an issue to trello once this functionality is added.

Single line (--) comments cannot be parsed by get_sql_from_file

This could just be me, but using pydbtools 4.0.1 (current release at time of writing), queries read in using get_sql_from_file fail to parse if they include single line -- comments:

import pydbtools as pydb

with open('test.sql', 'w') as f:
    f.write('--\nselect 1 AS one')
    
qry = pydb.get_sql_from_file("test.sql")
response = pydb.start_query_execution_and_wait(qry)
print(response.get("Status"))

CCDE-295: Allow CREATE TEMP TABLE query to be ran from a SQL file

At the moment pydbtools has a create_temp_table method. It would be useful to allow users to do this via an SQL file.

CREATE TEMPORARY TABLE <table name> AS (
  <normal SQL query>
)

Would then trigger:

pydb.create_temp_table(sql="<normal SQL query>", table_name = "<table name>")

CCDE-300: Add CRON / AIRFLOW job to remove temp databases

On our platform, when pydbtools creates a temp database it prefixes the name of the glue schema as mojap_de_temp_<ts> currently these are not cleaned up.

You should write a python script using AWS wrangler that gets the database names and filters it by the prefix mojap_de_temp. Then any TS > 24 hours from the current script run TS is deleted. Then put this script on a daily run at say like 4am or something on Airflow.

pandas column-type overwrite

Allow option to overwrite standard conversion e.g.

{
  "int": "character",
  "long": "character"
}

Should also enable specific column type overwrite:

{
  "col2": "int"
}

☝️ First thinking is to always use agnostic meta data types to make it easier for the user to understand. However there is probably more flexibility in pandas datatypes so maybe there should be the option for both.

CCDE-294: need to provide database name to updated `read_sql_query` method

In the latest awswrangler package you need to provide a database in your queries instead of None (what we used to do).

To get a work around for this you will need to use the existing SQL parser functionality in the package to get a database name and then pass that database name into awswrangler if the user provides database=None (if they specify the database value then the package doesn't need to do anything)

Data types and speed

  • Need job that just does term frequency adjustment
  • Save out two join tables so we know how big they are
  • See whether they are broadcasting and whether we should hint the broadcast
  • How do we make the small tables shuffle to the big tables?

Data types and speed

The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.

However, it is possible to save results out to parquet using a create table as statement.

This has two benefits:

  • Read speeds of parquet from s3 are a lot faster (double or more) reading csv
  • The resultant dataframe in pandas is guaranteed to have the right data types.

One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.

I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.

Once we've done this, we should probably deprecate the python_athena_tools repo.

STRING TYPE not supported

Relates to #3

SHOW TABLES FROM database

Returns an athena datatype of string. Also output file is a .txt file.

pydbtools should recognise text outputs (instead of CSVs) and read in appropriately

CCDE-388: Allow users to provide/edit utils

In the AP move to EKS we need to provide a different way to set temp_database_name/location see the changes here

This hotfix branch should be the default in the next release.

Atm a lot of the utils are hardcoded or cannot be changed in order for others to change how these things are defined. I think it would be good to allow users to update \ change utils at runtime of pydbtools. Have a look at how aws-wrangler changes it’s settings (config). They essentially have a settings class in a private module then create an object of that class and expose that in a public module as the utils/settings. Then you can just edit the values/functions of that utils object.

Docs

Code

Query caching issues - query returning results different to Athena GUI

I've run into an issue a few times where I run a query usin read_sql_query and get completely different results frmo the Athena GUI.

Restarting kernel and re-running fixes it.

I think this is likely to be a result of caching behaviour - possibly from awswrangler.

I haven't yet been able to create a fully reproducible example. The query I was running this morning was:

sql = """
select *
from data_linking_temp.nomis_csv_qa
limit 10
"""

df = pydb.read_sql_query(sql)

Note: This isn't simply a result of different rows being returned (which might be expected due to the distributed nature of the query). The query was returning the wrong columns and the wrong number of rows

FileNotFoundError when using s3fs >= 0.3.0

Currently s3fs version >= 0.3.0 is causing issues (see pandas and s3fs related issues).

Note to reproduce each error restart your kernel before each run.

Error 1:

import pydbtools as pydb

 # Works
df1 = pydb.read_sql("SELECT * FROM db.table limit 10")

# Errors (same traceback as issues referenced above)
df2 = pydb.read_sql("SELECT * FROM db.table limit 10") 

Error 2:

from gluejobutils import s3
import pandas as pd

no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"

pd.read_csv(no_file) # Same error as above

Error 3:

from gluejobutils import s3
import pandas as pd

no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"

df1 = pd.read_csv(is_file) # works as expected
s3.copy_s3_object(is_file, no_file)

df2 = pd.read_csv(no_file) # same error

Issue seems to be that s3fs is caching the list of objects in the bucket. So the first call works fine as there is no cache. The second call to the same bucket refers to the cache and sees that there is no file (as it was created post caching).

Think the interim solution for now is to get pydbtools to set up it's own s3fs.S3FileSystem and parse that to pandas.read_csv so that we can force clearing the cache in s3fs (i.e copy this)

Currently using s3fs 0.3.4, pandas 0.25., pydbtools 1.0.2, gluejobutils 3.0.0

document iam policy requirements

pydbtools (and indeed dbtools) outputs to the bucket alpha-athena-query-dump. While I think it will mostly be used interactively, for those who want to use it in an airflow task, it's worth adding a note in the readme that it'll need to be added to the iam policy in order to work

DECIMAL presto type not supported

import pydbtools as pydb

sql = """
SELECT CAST(0 AS DECIMAL(1,1)) as test
"""
out = pydb.read_sql(sql)

=== Error ===

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-1-f290daede81f> in <module>
      4 SELECT CAST(0 AS DECIMAL(1,1)) as test
      5 """
----> 6 out = pydb.read_sql(sql)

~/.local/lib/python3.6/site-packages/pydbtools/read_sql.py in read_sql(sql_query, timeout, *args, **kwargs)
     17 
     18     # Read in the SQL query
---> 19     dtype, parse_dates = _pd_dtype_dict_from_metadata(response['meta'])
     20     s3_path = response['s3_path'].replace('s3://', 's3a://')
     21     df = pd.read_csv(s3_path, dtype=dtype, parse_dates=parse_dates, *args, **kwargs)

~/.local/lib/python3.6/site-packages/pydbtools/utils.py in _pd_dtype_dict_from_metadata(athena_meta)
     57     for c in athena_meta:
     58         colname = c["name"]
---> 59         coltype = _athena_meta_conversions[c["type"]]['pandas']
     60         dtype[colname] = np.typeDict[coltype]
     61         if c["type"] in ["date", "timestamp"]:

KeyError: 'decimal'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.