Giter VIP home page Giter VIP logo

delta-sharing-r's Introduction

R Delta Sharing Connector

Note: Working on an updated version, coming soon :)

# connect to client
library(delta.sharing)
client <- sharing_client("~/Desktop/config.share")

# see what data is accessible
client$list_shares()
client$list_all_schemas()
client$list_schemas(share = "deltasharingr")
client$list_tables(share = "deltasharingr", schema = "simple")
client$list_tables_in_share(share = "deltasharingr")

# table class
ds_tbl <- client$table(share = "deltasharingr", schema = "simple", table = "all_types")

# (optional) specify a limit (best effort to enforce)
ds_tbl$set_limit(limit = 1000)
ds_tbl$limit

# (optional) where to download files (before arrow kicks in)
ds_tbl$set_download_path("~/Desktop/share-download/")

# load data in as arrow::Dataset 
ds_tbl_arrow <- ds_tbl$load_as_arrow()
# if schema mapping is casuing problems, infer the schema
# ds_tbl_arrow <- ds_tbl$load_as_arrow(infer_schema = TRUE)

# do standard {dplyr} things if you like that
ds_tbl_arrow %>%
  select(1, 2) %>%
  mutate(x = column1 + column2) %>%
  collect()

# just want a tibble? (alias for collect on arrow)
ds_tbl_tibble <- ds_tbl$load_as_tibble()
# ds_tbl_tibble <- ds_tbl$load_as_tibble(infer_schema = TRUE)

delta-sharing-r's People

Contributors

zacdav-db avatar

Stargazers

Kyle Harris avatar William "Logan" Downing avatar Ajmal avatar Andrei Miroshnichenko avatar Ruslan Dautkhanov avatar

Watchers

 avatar

delta-sharing-r's Issues

error when using CDF functionality in `dev` branch

@zacdav-db Thanks so much for this package!

I first tried out the standard functionality that is currently available in main which works great. But as I also wanted to test the change data feed functionality and installed the dev version.

I'm getting the following error message when trying to load data:

> ds_tbl_cdf$set_cdf_options(starting_version = 1)
> ds_tbl_cdf_tibble <- ds_tbl_cdf$load_tibble(changes = TRUE)
deleting 1 files that are no longer referenced
Error: rapi_prepare: Failed to prepare query create or replace table '/var/folders/bs/m184ytk15hddvjmg4zqrq8rh0000gq/T//RtmptNIogA/8e685855-71f0-43a3-9f6f-e176a36655f7/_table_changes/' as
with changes as (
  select *
  from read_parquet('/var/folders/bs/m184ytk15hddvjmg4zqrq8rh0000gq/T//RtmptNIogA/8e685855-71f0-43a3-9f6f-e176a36655f7/_table_changes//*__cdf_*', filename=true)
),
other as (
  select * exclude (filename), null as _change_type, filename
  from read_parquet('/var/folders/bs/m184ytk15hddvjmg4zqrq8rh0000gq/T//RtmptNIogA/8e685855-71f0-43a3-9f6f-e176a36655f7/_table_changes//*__[i|d]*_*', filename=true)
),
all_data as (
  select * from changes
  union
  select * from other
),
dataset as (
  select
    *,
    str_split(regexp_extract(filename, '.*\/(.*)\..*', 1), '_') as metadata,
    to_timestamp(metadata[5]::bigint/1000)::string as _change_timestamp,
    metadata[6]::int as _change_version,
    from all_data
)
select
  *
  exclude (filename, metadata)
  replace (coalesce(_chang

Unable to use with delta-sharing.io reference data

Hello. I'm new to R, so forgive the beginner question, but I'm unable to use the connector against the reference data provided by delta-sharing.io.

I've built the package and I'm running it this way:

install.packages("C:\\temp\\delta.sharing_0.1.1.zip", repos = NULL, type="source")
library("arrow")

profile_path = "C:\\temp\\open-datasets.share"
download_path <- "C:\\temp\\share-download"
share <- "delta_sharing"
schema <- "default"
table <- "lending_club"

library(delta.sharing)
client <- sharing_client(profile_path)

# table class
ds_tbl <- client$table(share, schema, table)

# (optional) specify a limit (best effort to enforce)
ds_tbl$set_limit(limit = 1000)
ds_tbl$limit

# (optional) where to download files (before arrow kicks in)
ds_tbl$set_download_path(download_path)

# just want a tibble? (alias for collect on arrow)
ds_tbl_tibble <- ds_tbl$load_as_tibble()

# write the tibble out to CSV
write.table(ds_tbl_tibble, file = file.path(download_path, "tibble.csv"))

However, I get the following output:

Error in `arrow::open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'C:/temp/share-download/fe9ef647-d848-476b-afbe-9083b532ec24/0df8a546325957122d72659e2ca8edc1.parquet': Could not open Parquet input source 'C:/temp/share-download/fe9ef647-d848-476b-afbe-9083b532ec24/0df8a546325957122d72659e2ca8edc1.parquet': Couldn't deserialize thrift: don't know what type: �
. Is this a 'parquet' file?

Any help would be appreciated. Thanks!

Feature request: add option to load DeltaSharing credentials/token from a list instead of a file

Hey @zacdav-db, I was wondering whether you would be willing to implement an alternative way to load the DeltaSharing credentials.

Currently, the sharing_client() function expects a JSON file (typically config.share). While that works well locally, it doesn't work as well when running jobs through GithubActions where I didn't want to check in the file into version control itself but instead work with Github Sectrets / environment variables.

I worked around it by writing a wrapper function that creates the JSON from the Secrets and saves it as a temporary file that is read in using sharing_client() (see below) but I was wondering whether you could just add another parameter to the sharing_client() function, e.g., credentials_list (and rename the current parameter credentials_file), that would allow to pass it a list with the credentials (i.e., skips the jsonlite::read_json() step and just provides its output object creds directly.

Thanks for your consideration!

Here my wrapper function for reference:

create_deltasharing_client <- function() {
  temp_file <- "config-temp.share"
  
  list(
    shareCredentialsVersion = 1,
    bearerToken             = Sys.getenv("TOKEN_DELTASHARING"),
    endpoint                = Sys.getenv("URL_DELTASHARING"),
    expirationTime          = "2024-09-20T22:16:14.933Z"
  ) |> 
    jsonlite::toJSON(auto_unbox = TRUE) |> 
    writeLines(temp_file)
  
  client <- delta.sharing::sharing_client(temp_file)
  
  file.remove(temp_file)
  
  client
}

P.S. Are you still planning to release a new version that will enable the CDF functionality?

Fails to Open Dataset

Thanks for the work on this project so far! It's looking really promising!

Getting set up was relatively easy and everything was going fine up until I tried to load the data as arrow.

library(delta.sharing)
client <- sharing_client("C:\\Users\\wld0303\\Downloads\\config.share")

client$list_shares()
client$list_all_schemas()

client$list_tables(share = "test", schema = "raw")

data <- client$table(share='test', schema='raw', table='synthea_patients')
data$set_download_path('C:\\Users\\wld0303\\Downloads')

data_tbl_arrow <- data$load_as_arrow()

The resulting error is below:
image

Do you have any suggestions on what I might have done wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.