Giter VIP home page Giter VIP logo

rdbtools's Introduction

Rdbtools

Please note that this is not officially supported by the AP team and is intended to be community supported.

What is Rdbtools?

This is an extension of the noctua package, for interacting with AWS Athena through the MoJ's analytical platform. See https://dyfanjones.github.io/noctua/reference/index.html

The primary purpose of this package is to easily allow MoJ analysts to access data on Athena, without having to understand anything about the underlying authentication systems. This access is provided through the R database interface DBI, and so works with the standard database functions used in R. It also works with dbplyr, which is an extention of dplyr allowing you to use familiar tidyverse functions on data in Athena itself (reducing the need for large data pre-processing steps in R, and without having to learn SQL).

In addition, this package extends the methods defined in the noctua package to allow users easy access to a safe temporary database for intermediate processing steps.

The secondary purpose of this package is to provide backwards compatability with dbtools which does not work on the new AP infrastructure. For this the package provides a few convenience functions for MoJ users. The key difference with this package over dbtools is that it is implemented all in R and doesn't require a Python dependency.

Installing Rdbtools

Then install Rdbtools with one of the the following commands:

  • If using renv:
    • install Rdbtools: renv::install("moj-analytical-services/Rdbtools") or if that doesn't work try renv::install("[email protected]:moj-analytical-services/Rdbtools.git")
  • If not using renv:
    • install Rdbtools: devtools::install_github("moj-analytical-services/Rdbtools") (you may need to install devtools first)

You can use the same command to update the package, if it is changed on Github later.

How to use

Basic connecting a session and querying

With SQL commands (using DBI)

See https://dyfanjones.github.io/noctua/reference/index.html for the full list of functions you can call to interact with Athena.

To query a database, use:

library(Rdbtools)
con <- connect_athena() # creates a connection with sensible defaults
data <- dbGetQuery(con, "SELECT * FROM database.table") # queries and puts data in R environment
dbDisconnect(con) # disconnects the connection

Using dbplyr

See https://dbplyr.tidyverse.org/index.html

As an example:

library(tidyverse)
library(dbplyr)
library(Rdbtools)

con <- connect_athena()
datadb <- tbl(con, sql("select * from database.name")) # create the dbplyr link
# use dplyr as usual on this dataframe link
datadb %>%
  filter(size < 10) %>%
  group_by() %>%
  summarise(n = n(),
            total = sum(total))

dbDisconnect(con) # disconnects the connection

Note that if you need any function within dbplyr which does a copy (e.g. joining a local table to a remote table) then you need to ensure you have the right permissions for the staging directory you are using. See the help page for dbWriteTable by running ?dbWriteTable in the console.

The temporary database

Each user can have a database which can store temporary tables.

Note that the tables created here will have their underlying data stored in the default staging directory (which is different for each new connection) or that specified by the staging directory argument (which will remain the same for each new connection). The permissions of the staging directory will determine who can access the data in the temporary tables.

With SQL commands (using DBI)

Wherever you put the special string __temp__ in SQL commands then this will refer to a database which is specific to your user and where you can write temporary tables before you read them out. This works with the DBI functions (which are updated in this package for connections made via connect_athena()) and the convenience functions (e.g. read_sql()).

library(Rdbtools)
con <- connect_athena() # creates a connection with sensible defaults
dbExecute(con, "CREATE TABLE __temp__.name AS SELECT * FROM database.table") # queries and puts in temp space
data <- dbGetQuery(con, "SELECT * FROM __temp__.name") # queries and puts data in R environment
dbDisconnect(con) # disconnects the connection

The __temp__ string substitution is implemented for:

  • dbGetQuery
  • dbExecute
  • dbGetTables
  • dbListTables
  • dbExistsTable
  • dbListFields
  • dbRemoveTable
  • dbWriteTable (but note the permission issue in the help for this function by running ?dbWriteTable in the console)

If there are further noctua/DBI function where the __temp__ string substitution would be useful then open up an issue or pull request and the Rdbtools community can try and arrange an implementation.

Using dbplyr (or other packages)

The __temp__ string is not understood by dbplyr functions, so to use the temporary database for this or other packages you have two options:

  • When creating the connection, you can specify the temporary database as the default schema: connect_athena(schema_name = "__temp__"). In this case dbplyr commands which do not specify a database will default to the temporary database (e.g. then compute("temp_tbl")) at the end of a dbplyr chain will create a table in the temporary database with the name "temp_tbl").
  • Alternatively, the athena_temp_db function will return a string with the name of the temporary database if required to manually create specific SQL commands, or in use in other functions not listed above.

The temporary database is the same each way, so you can mix dbplyr, DBI, and other packages in the same code.

Advanced use

The connection object

The connection object returned by connect_athena() contains all the information about a single authenticated session which allows access to the databases for which you have permission. By default the authenticated session will last for one hour, after which you will have to create a new connection or else refresh your connection. For most purposes creating a new connection will be sufficient, however you will lose access to any tables created in the __temp__ database (as these are only accessible under the same session). To refresh a connection, please use the refresh_athena_connection() function, or in a long script the refresh_if_expired() function may also be useful (see the help pages in RStudio for further details of these functions).

The region argument when creating connection object

The region passed into the connect_athena() will be used for

  • Get temporary token for connecting athena service
  • Run the query and store the query result to the staging dir

In order to run the query successfully, the region need to the region where the query will be run and query result will be stored in the staging dir. You can pass the value based on your case when calling connect_athena(), by default, the region will be decided based on serveral environment variables below:

  • AWS_ATHENA_QUERY_REGION: An environment variable for specifying the region when the region where the query will be run is different from the default region from underlying running environment.

  • AWS_DEFAULT_REGION and AWS_REGION: The default region which usually will be setup by the underlying running environment e.g. cluster, and they cannot be amended

othewise use eu-west-1 as the default

In most cases, you do not need to worry about the region, the default region (AWS_DEFAULT_REGION and AWS_REGION) should be the one for running query and the one where your staging dir is. When there is cross-region situation in your runnning environment and you want to save the time for passing the region every time when creating connection, you can use the AWS_ATHENA_QUERY_REGION to specify it.

Single queries (deprecated)

The function read_sql is provided which replicates the same function from dbtools - this is kept for backwards compatibility only. This creates a database connection, reads the data and then closes the connection every call. If you want to do more than one call to Athena the method below is probably better. Also note that since authentication has moved to WebIdentity then any new temporary tables created under one connection will only be accessible by that same connection, so read_sql cannot be used to read a table created by a another function unless the relevant connection object is supplied to the con argument (this is different to previous usage of read_sql.

rdbtools's People

Contributors

pjrh-moj avatar mratford avatar andreassoteriadesmoj avatar

Stargazers

柯南 avatar Larefly avatar Tom Hepworth avatar Jimmy Briggs avatar

Watchers

John Griffin  avatar Neill Turner avatar Gareth.m.Davies avatar PA avatar  avatar loulou avatar Yasin Mustafa avatar  avatar Richard McHale avatar matt tei avatar  avatar Henry Goldsack avatar David Hawes avatar Jonathan Roberts avatar  avatar Robert Hunt avatar Tim Wright avatar Mariana avatar Sandy Gudgeon avatar  avatar Tom Dewar avatar  avatar  avatar tam avatar  avatar  avatar  avatar  avatar Jasper avatar  avatar

rdbtools's Issues

temporary tables not temporary

Hi,

I'd like to create a temporary table, i.e. one that doesn't persist, from a data frame, however this does not currently seem possible out of the box. The closest thing to it I currently do is:

library(Rdbtools)
con <- connect_athena()
dbWriteTable(
    conn = con, 
    value = df, 
    name ="__temp__.temp_table", 
    overwrite = TRUE,
    s3.location = "s3://alpha-my-bucket/"
  )
dbDisconnect(con)

where s3://alpha-my-bucket/ is an S3 bucket I have access to. This adds the table inside a new folder in that bucket, so I would have to delete the table myself to remove it.

Could the function reset the s3.location to something like the following automatically, when __temp__ appears in name?:

user_id <- paws::sts()$get_caller_identity()$UserId
s3_location = paste("s3://mojap-athena-query-dump", user_id, sep = "/")

Or another solution that creates a temporary table?

noctua dependency issue

Hi just installed this package into a new project and had an error for the first time:

renv::install("moj-analytical-services/Rdbtools")

This gave an error:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘noctua’ 2.6.2 is being loaded, but <= 2.6.1 is required

A workaround was to install an older version of the required dependency using:

renv::install("[email protected]")

Then the original command succeeded:

renv::install("moj-analytical-services/Rdbtools")

User access removed, access is now via a team

Hi there

The user mratford had Direct Member access to this repository and access via a team.

Access is now only via a team.

You may have less access it is dependant upon the teams access to the repo.

If you have any questions, please post in (#ask-operations-engineering)[https://mojdt.slack.com/archives/C01BUKJSZD4] on Slack.

This issue can be closed.

Cannot create dplyr::tbl object from temporary database

I would like to use R’s dplyr on database tables instead of in-memory data frames. I want to do something very simple, like in the following example:

con <- Rdbtools::connect_athena()
my_data <- dplyr::tbl(con, dbplyr::in_schema('some_existing_database', 'some_existing_table'))

In the example above, dplyr will create a tbl object from table some_existing_table which is in some_existing_database.

The example above works just fine and returns a tbl object. However, when I try to create the tbl object with temporary tables, it fails:

# Create temporary table
Rdbtools::dbExecute(
  con, 
  "CREATE TABLE __temp__.my_data AS SELECT * FROM some_existing_database.some_existing_table limit 10"
)

# Create tbl from temporary table
my_data <- dplyr::tbl(con, dbplyr::in_schema('__temp__', 'my_data'))

# Error: EntityNotFoundException (HTTP 400). Database __temp__ not found.

Any tips please?

Write tables

Hi @mratford and @ymao2 ,

I'm just trying to tidy up Rdbtools a bit as more people are using it. The last bit of functionality I haven't got working is to write data to Athena - I don't know if this is a permissions issue?

The code currently lets you create a temporary database, and create new tables in it from existing tables on Athena (e.g. CREATE TABLE *** AS ***) but not using the dbWriteTable function from noctua. I'd like to get dbWriteTable - see my attempts in the branch table_write.

I should be able to do:

library(tidyverse)
test_table <- tibble(col1 = c(1,2,3,4), col2 = c("a","b","c","d"))
con <- connect_athena()
dbWriteTable(con, "__temp__.tab_test", test_table)

However I try this, I get: Error: AccessDenied (HTTP 403). Access Denied.

The reason I want this is twofold

  • one is simple I often have a list of ids that I want to use to select rows out of an existing table. It would be really useful to be able to write these to a temporary table and then join in the required records. Composing a SELECT ... WHERE id = ... statement doesn't work with too many ids.
  • secondly in dbplyr (of which I'm a big fan!) if you want to join a local table to one in Athena (this could be to do the same as above, or as a lookup) then internally this needs to write a table to Athena

Just wondering if either of you could shed some light on this!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.