This is a simple package that lets you query databases using Amazon Athena and get the s3 path to the athena output (as a csv). This is significantly faster than using database drivers provided by Amazon, so might be a good option when pulling in large data.
Note: this package works alongside user IAM policies on the Analytical-Platform and requires you to be added to be given a standard database access. If in our github organisation you will be able to access the repo to request standard database access here.
This package uses the Python package pydbtools under the hood. Make sure your R-Studio deployment is up to date and has Python 3.6 or higher installed.
Before using dbtools you'll need to install pydbtools v2.0.2. Do this from the R terminal:
# in terminal
python -m pip install -U pydbtools==2.0.2
Then install dbtools itself. The best way to do this is via conda:
# in terminal
conda install -c moj-analytical-services r-dbtools
Package requirements are:
-
s3tools
(preinstalled) -
reticulate
-
python
(preinstalled - version 3.6 or higher) -
boto3
(preinstalled) -
readr
(preinstalled) -
data.table
(version 1.11.8 or above)
The easiest way to read in the data:
# returns SQL query with matching data types as a tibble
df = dbtools::read_sql("SELECT * from crest_v1.flatfile limit 10000")
# Read df as a data.table
dt = dbtools::read_sql("SELECT * from crest_v1.flatfile limit 10000", return_df_as = "data.table")
If you want to read in your data using a specific method
### Read SQL query using your own read csv method
response <- dbtools::get_athena_query_response("SELECT * from crest_v1.flatfile limit 10000")
# print out path to athena query output (as a csv)
print(response$s3_path)
# print out meta data
print(response$meta)
# Read in data using whatever csv reader you want (in this example using data.table::fread but reading everything as a string)
s3_path_stripped = gsub("s3://", "", response$s3_path)
df <- s3tools::read_using(FUN = data.table::fread, s3_path=s3_path_stripped)
When using the read_sql
function you can specify the type of dataframe to return:
- tibble (default)
- data.table
- dataframe
note: to find out more on this function see the function documentation i.e. ?dbtools::read_sql
Each is a type of dataframe in R and have different quirks when converting from Athena datatypes to R datatypes.
-
tibble: This is the default dataframe choice as it was the only dataframe that converts dates and datetimes (aka timestamps) on read rather than requiring a second parse of the data to convert date and timestamps to their correct types from strings. This is a good option if your data is not that large and you like those tidyverse things. One downside is that long integers are actually stored as doubles (this is because tibbles currently don't support 64 bit integers - see issue).
-
data.table: This dataframe class is really good for larger datasets (as it's more memory efficient and just generally better). long integers are read in as int64. Dates and datetimes are read in as strings. Feel free to cast the columns afterwards,
data.table::fread
doesn't convert them on read - see documentation. -
dataframe: Added support for this because it's the base dataframe type in R. However, Athena exports CSVs with every value in double quotes because of this the
scan
function that is called internally byread.csv
throws an error unless you specify columns as a character (see issue). Therefore the returning dataframe has every column type as a character. Feel free to cast the columns afterwards.
Below is a table that explains what the conversion is from our data types to the supported dataframe in R (using the read_sql function):
data type | tibble type (R atomic type) | data.table type (R atomic type) | dataframe type (R atomic type) |
---|---|---|---|
character | readr::col_character() (character) | character | character |
int | readr::col_integer() (integer) | integer | character |
long | readr::col_double() (double) | bit64::integer64() (double) | character |
date | readr::col_date() (double) | character | character |
datetime | readr::col_datetime() (double) | character | character |
boolean | readr::col_logical() (logical) | logical | character |
float | readr::col_double() (double) | double | character |
double | readr::col_double() (double) | double | character |
Note: If the R atomic type is not listed in the table above then it is the same as the type specified
The output from dbtools::get_athena_query_response(...) is a list one of its keys is meta
. The meta key is a list where each element in this list is the name (name
) and data type (type
) for each column in your athena query output. For example for this table output:
col1 | col2 |
---|---|
1 | 2018-01-01 |
2 | 2018-01-02 |
... |
Would have a meta like:
response$meta[[1]]$name # col1
response$meta[[1]]$type # int
response$meta[[1]]$name # col2
response$meta[[1]]$type # date
The meta types follow those listed as the generic meta data types used in etl_manager. If you want the actual athena meta data instead you can get them instead of the generic meta data types by setting the return_athena_types
input parameter to TRUE
e.g.
response <- dbtools::get_athena_query_response("SELECT * from crest_v1.flatfile limit 10000", return_athena_types=TRUE)
print(response$meta)
- Amazon Athena using a flavour of SQL called presto docs can be found here
- To query a date column in Athena you need to specify that your value is a date e.g.
SELECT * FROM db.table WHERE date_col > date '2018-12-31'
- To query a datetime or timestamp column in Athena you need to specify that your value is a timestamp e.g.
SELECT * FROM db.table WHERE datetime_col > timestamp '2018-12-31 23:59:59'
- Note dates and datetimes formatting used above. See more specifics around date and datetimes here
- To specify a string in the sql query always use '' not "". Using ""'s means that you are referencing a database, table or col, etc.
- When data is pulled back into rStudio the column types are either R characters (for any col that was a dates, datetimes, characters) or doubles (for everything else).
When you run a query in SQL against our databases you are using Athena. When Athena produces the output of an SQL query it is always written to a location in S3 as a csv. dbtools defines the S3 location based on your AWS role. It will write the output CSV into a folder only you have read/write access to, and then read it in using s3tools
. Once the data has been read into a dataframe dbtools will delete the CSV from your folder.
Note: dbtools requires you to have the StandardDatabaseAccess group policy attached. If you want to use dbtools please ask the data engineering team (on slack ideally via the #analytical_platform channel).
- Fixes prompts to install miniconda - now automatically uses main Analytical Platform Conda Python, based on sys path
- Fixed issue where credentials would not refresh
- Is now dependant on
pydbtools
package - SQL queries like
SHOW COLUMNS FROM db.table
now work forread_sql
and return a df.
- Updated the version in the DESCRIPTION file to the correct version
- Removed input parameters
bucket
andoutput_folder
fromread_sql
andget_athena_query_response
functions. New section to README named 'Under The Hood' explains why. - Note package now requires the group policy
StandardDatabaseAccess
to be attached to the role that needs to use this package.
- Added function
read_sql
which reads an SQL query directly into an R dataframe. See R documentation (i.e.?read_sql
) - Input parameter
out_path
in functionget_athena_query_response
has been replaced by two input parametersbucket
andoutput_folder
. E.g. If yourout_path="s3://my-bucket/__temp__"
then the new input params arebucket=my-bucket
andoutput_folder=__temp__
. Note thatoutput_folder
defaults to value__athena_temp__
it is recommended that you leave this unchanged.
timeout
is now an input parameter toget_athena_query_response
if not set there is no timeout for the athena query.get_athena_query_response
will now print out the athena_client response if the athena query fails.