Giter VIP home page Giter VIP logo

botor's Introduction

botor: Reticulate wrapper on 'boto3'

Project Status: Active โ€“ The project has reached a stable, usable state and is being actively developed. CRAN Build Status Code Coverage

This R package provides raw access to the 'Amazon Web Services' ('AWS') 'SDK' via the 'boto3' Python module and some convenient helper functions (currently for S3 and KMS) and workarounds, eg taking care of spawning new resources in forked R processes.

Installation

This package requires Python to be installed along with the boto3 Python module, which can be installed from R via:

reticulate::py_install('boto3')

If that might result in technical problems that you cannot solve, then it's probably easier to install a standalone Python along with the system dependencies etc via rminiconda.

Once the Python dependencies are resolved, you can either install from CRAN or the most recent (development version) of botor can be installed from GitHub:

remotes::install_github('daroczig/botor')

Loading the package

Loading the botor package might take a while as it will also import the boto3 Python module in the background:

system.time(library(botor))
#>    user  system elapsed 
#>   1.131   0.250   1.191

Getting started

Quick examples:

  1. Check the currently used AWS user's name:

    iam_whoami()
    #> [1] "gergely-dev"
  2. Read a csv file stored in S3 using a helper function:

    s3_read('s3://botor/example-data/mtcars.csv', read.csv)
    #>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    #> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
    #> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
    #> ...
  3. Encrypt a string via KMS using a helper function:

    kms_encrypt('alias/key', 'secret')
    #> [1] "QWERTY..."
  4. Get more info on the currently used AWS user calling the IAM client directly:

    iam()$get_user()
  5. Create a new client to a service without helper functions:

    ec2 <- botor_client('ec2')
    ec2$describe_vpcs()

AWS Auth

The botor package by default will use the credentials and related options set in environmental variables or in the ~/.aws/config and ~/.aws/credentials files. If you need to specify a custom profile or AWS region etc, there are various options with different complexity and flexibility:

  • set the related environment variable(s) before loading botor
  • call the botor() function with the relevant argument to set the config of the default session for the botor helper functions, eg
botor(region_name = 'eu-west-42')
  • if you need to manage multiple sessions, then use the raw boto3 object from the botor package and boto3.session.Session to init these custom sessions and the required clients/resources on the top of those, eg
my_custom_session1 <- boto3$Session(region_name = 'us-west-1')
my_custom_s3_client1 <- my_custom_session1$client('s3')
my_custom_session2 <- boto3$Session(region_name = 'us-west-2')
my_custom_s3_client2 <- my_custom_session2$client('s3')

Using the raw boto3 module

The botor package provides the boto3 object with full access to the boto3 Python SDK. Quick example on listing all S3 buckets:

library(botor)
s3 <- boto3$resource('s3')
library(reticulate)
iter_next(s3$buckets$pages())

Note that this approach requires a stable understanding of the boto3 Python module, plus a decent familiarity with reticulate as well (see eg iter_next) -- so you might want to rather consider using the helper functions described below.

Using the default botor session

Calling botor() will provide you a default boto3 session that is cached internally. You can always override the default session by calling botor() again with new arguments. See eg setting the default boto3 session to use us-west-2:

botor(region_name = 'us-west-2')
botor()$resource('s3')

A great advantage of using botor() instead of custom sessions is that it's fork-safe. See eg:

attr(botor(), 'pid')
#> [1] 31225
attr(botor(), 'pid')
#> [1] 31225

lapply(1:2, function(i) attr(botor(), 'pid'))
#> [[1]]
#> [1] 31225
#>
#> [[2]]
#> [1] 31225

mclapply(1:2, function(i) attr(botor(), 'pid'), mc.cores = 2)
#> [[1]]
#> [1] 13209
#> 
#> [[2]]
#> [1] 13210

Convenient helper functions

Besides the botor pre-initialized default Boto3 session, the package also provides some further R helper functions for the most common AWS actions, like interacting with S3 or KMS. Note, that the list of these functions is pretty limited for now, but you can always fall back to the raw Boto3 functions if needed. PRs on new helper functions are appreciated :)

Examples:

  1. Listing all S3 buckets takes some time as it will first initialize the S3 Boto3 client in the background:

    system.time(s3_list_buckets())[['elapsed']]
    #> [1] 1.426
  2. But the second query is much faster as reusing the same s3 Boto3 resource:

    system.time(s3_list_buckets())[['elapsed']]
    #> [1] 0.323
  3. Unfortunately, sharing the same Boto3 resource between (forked) processes is not ideal, so botor will take care of that by spawning new resources in the forked threads:

    library(parallel)
    simplify2array(mclapply(1:4, function(i) system.time(s3_list_buckets())[['elapsed']], mc.cores = 2))
    #> [1] 1.359 1.356 0.406 0.397
  4. Want to speed it up more?

    library(memoise)
    s3_list_buckets <- memoise(s3_list_buckets)
    simplify2array(mclapply(1:4, function(i) system.time(s3_list_buckets())[['elapsed']], mc.cores = 2))
    #> [1] 1.330 1.332 0.000 0.000

The currently supported resources and features via helper functions: https://daroczig.github.io/botor/reference/index.html

Error handling

The convenient helper functions try to suppress the boring Python traceback and provide you only the most relevant information on the error. If you want to see the full tracelog and more details after an error, call reticulate::py_last_error(). When working with the raw boto3 wrapper, you may find botor:::trypy useful as well.

s3_download_file('s3://bottttor/example-data/mtcars.csv', tempfile())
#> Error in s3_download_file("s3://bottttor/example-data/mtcars.csv", tempfile()) : 
#>   Python `ClientError`: An error occurred (404) when calling the HeadObject operation: Not Found

s3_read('s3://botor/example-data/mtcars2.csv', read.csv)
#> Error in s3_download(object, t) : 
#>   Python `ClientError`: An error occurred (403) when calling the HeadObject operation: Forbidden

botor(region_name = 'us-west-2')
s3_read('s3://botor/example-data/mtcars.csv', read.csv)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> ...

Logging

botor uses the logger package to write log messages to the console by default with the following log level standards:

  • TRACE start of an AWS query (eg just about to start listing all S3 buckets in an AWS account)
  • DEBUG summary on the result of an AWS query (eg number of S3 buckets found in an AWS account)
  • INFO currently not used
  • WARN currently not used
  • ERROR something bad happened and logging extra context besides what's being returned in the error message
  • FATAL currently not used

The default log level threshold set to DEBUG. If you want to update that, use the package name for the namespace argument of log_threshold from the logger package, eg to enable all log messages:

library(logger)
log_threshold(TRACE, namespace = 'botor')

s3_download_file('s3://botor/example-data/mtcars.csv', tempfile())
#> TRACE [2019-01-11 14:48:07] Downloading s3://botor/example-data/mtcars.csv to '/tmp/RtmpCPNrOk/file6fac556567d4' ...
#> DEBUG [2019-01-11 14:48:09] Downloaded 1303 bytes from s3://botor/example-data/mtcars.csv and saved at '/tmp/RtmpCPNrOk/file6fac556567d4'

Or update to not fire the less important messages than warnings:

library(logger)
log_threshold(WARN, namespace = 'botor')

You can use the same approach to set custom (or more than one) log appenders, eg writing the log messages to files, a database etc -- check the logger docs for more details.

Why the name?

botor means "goofy" in Hungarian. This is how I feel when looking back to all the dev hours spent on integrating the AWS Java SDK in R -- this includes AWR.KMS, where I ended up debugging and fixing many issues in forked processes, but AWR.Kinesis still rocks :)

The name also reminds you that it's not exactly boto3, as eg you have to use $ instead of . to access methods.

botor's People

Contributors

amy17519 avatar daroczig avatar jburos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

botor's Issues

`s3_ls` fails without a useful error if uri does not end in `/`

Actual Behavior: s3_ls("s3://my-bucket") returns NULL in the case where my_bucket is valid. (ie s3_ls("s3://my-bucket/") would work)

My expectation would be that this would either work (returning a data frame of objects) or raise an error (similar to if the regex matching fails).

's3_ls()' gives error when accessing a specific bucket

Hi,

I am trying to list all the objects in an S3 bucket so I can read them later but I get the following error:

image

My code is this one where I want to read the file with the most recent modification date and show too the structure of the bucket:

image
image

Thanks!

s3_ls fails when object owner is not provided

We have encountered a situation where

botor/R/s3.R

Line 274 in 652996b

owner = object$data$Owner$DisplayName,

is failing with

Error in data.frame(bucket_name = uri_parts$bucket_name, key = object$data$Key,  : 
  arguments imply differing number of rows: 1, 0

I ran browser() in the above function and discovered that $Owner was not defined in objects[[1]]$meta$__dict__. Commenting out the above line (L274) solved the issue. I did not create this object in question so I'm not sure if it's possible to create an object without an owner or if perhaps the credentials I'm using do not have permissions to view who the Owner is. (All of the other metadata referenced in that code chunk is there. )

cc @orenscarmeli

Persisting boto3 session different s3_read / s3_write calls

Hi all,
The helper read and write functions are really helpful when interacting with R and AWS S3. However there is a problem when it comes to reading / write large amount of files/R objects to AWS S3. If a the user has AWS KMS enabled on their AWS S3 bucket then AWS KMS is called each time s3_read/s3_write is used. This means the KMS could be called hundreds of times if s3_read/s3_write is put in a loop. If another boto_session or s3_client was passed as an parameter then the same session can be used over multiple calls to the function s3_read/s3_write. This should reduce a cost to AWS KMS dramatically.

# example of s3_read
s3_read(uri, fun, ..., extract = c("none", "gzip", "bzip2", "xz"), s3_client = s3())

# Example code in practise:
library(botor)
library(data.table)

# create s3 client
s3 <- s3()

# list files to be read:
s3_files <- s3_ls("s3://mybucket/my_files/")

# persisting s3 client session
df = rbindlist(lapply(s3_files$uri, s3_read, fun = fread, s3_client = s3))

s3_ls function not working in CRAN version 0.3.0

First of all, great idea to create an R-wrapper for boto3 to directly communicate with s3 within scripts!

I have found the s3_ls function not working in version 0.3.0 with R version 4.0.3 on Windows 10 (tested also in linux environments with same effect). The error message is:

Error in data.frame(bucket_name = uri_parts$bucket_name, key = object$data$Key, :
arguments imply differing number of rows: 1, 0

botor does not play nice with `aws sso login`

When I am working with boto3 I can connect to AWS by typing aws sso login in the terminal.

However, when I am trying this with botor, its not working.

Example

# Go through the sso process via browser
system("aws sso login")

s3 <- botor::botor_client("s3")
pr<-s3$list_objects_v2(Bucket = "<BUCKET>",Prefix = "<PREFIX>",Delimiter = "/")
Error: botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

Write to S3 with KMS encryption

Hi,

Thanks a lot for this great library.
It works fine, but I am not sure how to realize the following case: Upload a file to S3 with KMS key encryption. E.g. in Python Boto3 I would do the following:

s3.Bucket('test-bucket').upload_file('test.txt', 'experiments/test.txt', ExtraArgs={"ServerSideEncryption": "aws:kms", "SSEKMSKeyId": "alias/test-key" })

Can this be done already? Or would it be possible to add?

Thank you a lot.

Best regards

`s3_read` doesn't play nice with `openxlsx`

It seem that when trying to read an xlsx file with the openxlsx package doesn't work. botor::s3_read("s3://mybucket/example_file.xlsx", fun=openxlsx::read.xlsx) results in Error: openxlsx can only read .xlsx files

I can confirm that reading the file locally (openxlsx::read.xlsx("example_file.xlsx")) works fine. And readxl's read_xlsx also works with botor. I would probably prefer readxl over openxlsx anyway, but it would be useful to understand why these aren't working nicely together.

Add Logging

Option to have logging of s3 paths when running s3_read/s3_write etc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.