Giter VIP home page Giter VIP logo

civis-r's People

Contributors

arthir avatar elsander avatar emmacooperpeterson avatar jacksonllee avatar jameslamb avatar jusblan331 avatar keithing avatar madhobbs avatar mheilman avatar nik-bladey avatar patr1ckm avatar wlattner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

civis-r's Issues

ENH: `await_all` should be vectorized over two arguments

Since platform polling calls like scripts_get_custom_runs require two arguments (e.g. a job and script id) this function should be vectorized over two arguments not just one. Adding an optional .y parameter to await_all should do it.

BUG: 'length(x) = 3 > 1' in coercion to 'logical(1)' - R CMD check++

FYI, adding the following to ~/.Renviron:

## Supported since R (>= 3.4.0)
_R_CHECK_LENGTH_1_CONDITION_=true

## Supported since R (>= 3.6.0)
_R_CHECK_LENGTH_1_LOGIC2_=true

produced the following R CMD check --as-cran error:

* using log directory ‘/home/hb/repositories/future/revdep/checks/civis/old/civis.Rcheck’
* using R version 3.6.0 (2019-04-26)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* using options ‘--no-manual --no-build-vignettes’
* checking for file ‘civis/DESCRIPTION’ ... OK
* this is package ‘civis’ version ‘1.6.1’
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘civis’ can be installed ... NOTE
Found the following notes/warnings:
  Non-staged installation was used
See ‘/home/hb/repositories/future/revdep/checks/civis/old/civis.Rcheck/00install.out’ for details.
* checking installed package size ... OK
* checking package directory ... OK
* checking ‘build’ directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking R/sysdata.rda ... OK
* checking line endings in shell scripts ... OK
* checking installed files from ‘inst/doc’ ... OK
* checking files in ‘vignettes’ ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ... ERROR
  Running ‘testthat.R’
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  Actual message: "'length(x) = 3 > 1' in coercion to 'logical(1)'"
  
  ── 3. Failure: retry on GET/PUT and 429 (@test_client_base.R#128)  ─────────────
  `call_api("GET", path, path_params, query_params, body_params)` threw an error with unexpected mes
sage.
  Expected match: "429"
  Actual message: "'length(x) = 3 > 1' in coercion to 'logical(1)'"
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 952 SKIPPED: 0 WARNINGS: 21 FAILED: 3
  1. Failure: no retry on GET/PUT and code 403 (@test_client_base.R#117) 
  2. Failure: no retry on GET/PUT and code 403 (@test_client_base.R#117) 
  3. Failure: retry on GET/PUT and 429 (@test_client_base.R#128) 
  
  Error: testthat unit tests failed
  Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking running R code from vignettes ... NONE
* checking re-building of vignette outputs ... SKIPPED
* DONE
Status: 1 ERROR, 1 NOTE

This could be upstreams - I haven't investigated.

read_civis / download_civis produce cryptic errors on NA

read_civis(as.numeric(Sys.getenv("asdf")))

Produces

Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: api.civisanalytics.comNA
Request failed [ERROR]. Retrying in 1 seconds...
Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: api.civisanalytics.comNA
Request failed [ERROR]. Retrying in 2 seconds...
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Could not resolve host: api.civisanalytics.comNA

Which is misleading. Like in the example, the error message occurs e.g. if you try to read an empty environment variable.

Basically the NA is of type numeric so it looks like there's a correct IO method to dispatch to, i.e. read_civis.numeric. From there, who knows how NA is dealt with in the API call construction!

ENH: await should automatically fetch script logs on failure

The default error handling in await looks for error or exception in the returned response. This is NULL for e.g. scripts_get_custom_runs.

To increase the odds we fetch the logs automatically on error we can

  1. first check for error or exception
  2. if NULL, attempt to look up the logging function via civis:::get_script_fun or fetch_logs.

Here is an example.

checks fail on DBItest dev

  ══ testthat results ═══════════════════════════════════════════════════════════
      OK: 982 SKIPPED: 4 FAILED: 5
      1. Error: DBItest: Driver: connect_format (@spec-driver-connect.R#21)
      2. Error: DBItest: Driver: connect_bigint_integer (@spec-driver-connect.R#52)
      3. Error: DBItest: Driver: connect_bigint_numeric (@spec-driver-connect.R#62)
      4. Error: DBItest: Driver: connect_bigint_character (@spec-driver-connect.R#72)
      5. Error: DBItest: Driver: connect_bigint_integer64 (@spec-driver-connect.R#85)

civis and roxygen2 6.1.0.

Passing on report from CRAN — with the latest version of civis, CRAN saw this failure in the revdep check:

Check: whether package can be installed
New result: WARNING
  Found the following significant warnings:
    Warning: roxygen2 requires Encoding: UTF-8

ENH: Add list of API calls used in each function

When users want to scope API keys it can be challenging to figure out exactly which verbs and endpoints are required for a specific API client function. For example write_civis() includes a POST to /files, a POST to /jobs/:id/runs, and a POST to /imports, but the documentation doesn't make it clear all that's involved so the user has to do a lot of tinkering to figure out exactly what's involved.

[BUG] civis::files_get() returns bad URL with link_expires_at param

I had a container script running successfully with this line:
deck_url <- civis::files_get(file_id, link_expires_at = (Sys.Date() + 2))

After the civis-r package was updated to version 2.0.0, my container script started failing (tagged to the latest docker image).

This was the error message in the logs:

2019-06-25 07:21:19 PM Error in call_api("GET", path, path_params, query_params, body_params) : 
2019-06-25 07:21:19 PM   Not Found (HTTP 404). Failed to GET https://api.civisanalytics.com/files/40998851?link_expires_at=2019-06-28. The requested endpoint could not be found.
2019-06-25 07:21:19 PM Calls: <Anonymous> -> call_api -> stop_for_status
2019-06-25 07:21:19 PM Execution halted
2019-06-25 07:21:19 PM Title: QC Ballot Toplines 2019-06-26

It seems like having an argument for link_expires_at caused it to get pasted to the output fileUrl link?

When I re-ran the same script and removed the link_expires_at argument with the line changed to this:
deck_url <- civis::files_get(file_id)
The container ran successfully with the civisanalytics/civisverse Docker image and latest image tag.

ENH: generalize and export fetch output

We have an un-exported utility function fetch_output. It seems useful to generalize and expose it to users:

fetch_output(job_id, run_id, name = NULL, using = readRDS)

Providing the name returns an object of that name. The file id is passed to read_civis with an appropriate argument to using.

Does it matter if scripts, containers or custom is used?

BUG: await_all fails

await_all is failing this example:

ids <- c(23820523, 23820523)
runs <- c(130390319, 130390319)
await_all(scripts_get_r_runs, .x = ids, .y = runs)

I think this has to do with passing args to call_once and safe_call_once. This will require a little refactoring of the existing call_once and maybe the tests as well.

@Zephyraith would you be up to taking a look?

`expires_at` default could be confusing

Due to the nature of the API, in some circumstances explicitly passing parameter=NULL is different than not passing the parameter at all. This isn't the convention of the R language and is particularly problematic with the write_civis_file, since the default is have files expire in 30 days.

@patr1ckm, I can think of two options to get around this:

  1. Make this distinction more clear in the docs
  2. Change the default in write_civis_file to a sentinel value like "DEFAULT"

Thoughts on this?

cc @harader for raising this issue offline.

remove devtools as a dependency

devtools depends on git2r, which has required system dependencies. This can cause local installs of civis-r to fail, and make installation more difficult.

We can replace devtools with just a call to roxygen for generating the documentation.

ENH: add `escaped` argument to `write_civis`

Adding this argument allows for more successful imports easily because " can be escaped either by "" (double quote) or \ (escaped).

write.csv by default uses "", so this feature flag will not lead to an internal inconsistency for csv files written by R destined for Redshift import.

For files generated by other systems, having this feature flag allows for successful imports of \ escaped files.

A cleaner solution would be to pass ... in write_civis to imports_post_syncs rather than write.csv, but that's a breaking change.

BUG: write_civis.numeric needs to specify firstRowIsHeader

write_civis.numeric imports a file id into redshift directly from redshift, however it doesn't expose the argument firstRowIsHeader as an advanced option for the

You can try it by uploading file 26320073.

For write_civis.numeric we actually don't use ... for anything, and could pass it to imports post syncs.

ENH: `write_civis` supports large file sizes.

We have a report where uploading large csv directly to redshift via write_civis fails when the file is large, at least 12 gb. The error is

Error in curl::curl_fetch_memory(url, handle = handle) :
 SSLRead() return error -36

I think in principle we could upload the file using the multipart upload endpoints instead of directly.

ENH: Error messages in failed scripts should be the job log

job <- scripts_post_r(name = "test", source = "stop('err')")
run <- scripts_post_r_runs(job$id)
await(scripts_get_r_runs, job$id, run$id)

Produces:

 Error: scripts_get_r_runs(id = 10602571, NA = 77083111): 

The object is:

scripts_get_r_runs(job$id, run$id)
<civis_api>
List of 7
 $ id               : int 77083111
 $ rId              : int 10602571
 $ state            : chr "failed"
 $ isCancelRequested: logi FALSE
 $ startedAt        : chr "2018-03-30T15:33:59.000Z"
 $ finishedAt       : chr "2018-03-30T15:34:37.000Z"
 $ error            : NULL
 - attr(*, "class")= chr [1:2] "civis_api" "list"

ENH: read_civis.civis_script

Read civis interface to scripts

I'm proposing that read_civis.civis_script have two additional arguments: regex and using. This design mimics the current behavior of read_civis which is to return values whenever it is unambiguous to do so.

Default

The default is using = NULL, which returns a named list of file ids. This doesn't return values. It does return a helpful result if you don't know what the object types are that are posted in the script.

out <- read_civis(civis_script(234)) 
val <- read_civis(out$name, using = reader))
> out
$name
[1] 1

With a regex, read_civis.civis_script returns a list of ids maching the regex.

out <- read_civis(civis_script(234), regex = ‘name’)
# always returns a list
val <- read_civis(out[[1]], using = reader)
vals <- lapply(out, read_civis, using = reader)

using = reader

With using = reader, read_civis returns the values directly.

Here reader is used to read all outputs:

val <- read_civis(civis_script(234), using = reader)

A more relevant case is with regex, which you can use to filter outputs that match a file extension, and then parse them with the right using:

val <- read_civis(civis_script(234), regex = ‘.csv’, using = read.csv))
val <- read_civis(civis_script(234), regex = ‘.rds’)
val <- read_civis(civis_script(234), regex = ‘.json’, using = jsonlite::fromJSON)

JsonValues

A really helpful feature of the API is that json values are simply returned as-is in out$value. So any non-null using can simply return JsonValues directly without attempting to parse/download the output.

Thoughts on this @keithing ? Is it too magical?

ENH: deploy_service

Write deploy_service that

  • deploys the a service if it isn't deployed and blocks until completion
  • redeploys the service if it is deployed, and blocks until completion
  • Only redeploys if a service isn't currently deploying

Only question is whether trying to redeploy an app that is currently deploying should

  1. throw an error and fail
  2. generate a warning and skip the redeploy request
  3. block until the current deployment has been completed and then continue with the requested redeploy.

I'm leaning toward option 3 so that a user doesn't have to write error handling / signalling / retrying around this state. The risk having a bunch of requests pile up to redeploy the app. I don't think it's a big risk though, it might just be confusing.

@keithing, @skomanduri thoughts?

ENH helper function to share models

A CivisML model has File and Project run outputs, each of which need to be shared to effectively share the "model". This helper will transparently handle sharing the run outputs. Include JSONValue in case a future version of CivisML uses that. The functions are patterned after the autogenerated API sharing endpoints.

See: civisanalytics/civis-python#315 in civis-python.

ENH: write_civis_file.data.frame

What would you think about having this generic? It would write the data frame as a csv to S3 by default. Extra args get passed to write.csv, and row.names = FALSE will be the default.

This would also work for tbl_df, tbl produced from e.g. read_csv since the objects inherit from data.frame:

xx <- read_csv("~/test/data.csv")
str(xx)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	20 obs. of  3 variables:
 $ X1      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ genre_id: int  13 9 4 18 10 16 19 3 8 7 ...
 $ genre   : chr  "Musical" "Fantasy" "Children" "War" ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 3
  .. ..$ X1      : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ genre_id: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ genre   : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

And

str(dplyr::as.tbl(data.frame(a=1)), 1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1 obs. of  1 variable:
 $ a: num 1

Since we know that data.frame is a tabular thing, it makes sense to store it on S3 in a platform independent way by default rather than using the R specific serialization. Some space would be traded off.

An alternative is feather, which would preserve factors better. However, the other side still has to know that feather is being used, and I'm not sure this really solves the problem.

@keithing for comments.

ENH: `fetch_logs` should poll until all logs are fetched

It can take some time for logs to populate from e.g. child processes. fetch_logs could poll the logging endpoint by default to make sure all logs are fetched. Since this only happens on errors, the time spent polling is probably not too troublesome.

DEPRECATION: Use result() method for CivisFuture instead of value()

Hi, author of future here. Since future 1.8.0 (2018-04-08) we have support for returning FutureResult objects from futures. A FutureResult object can hold:

  • 'value' - the value (of a non-errored evaluation)
  • 'stdout' - any captured stdout output
  • 'conditions' - any captured conditions, including messages, warnings, and an error (the latter if evaluation resulted in an error)

The default is still that futures return the old-style "value", i.e. the value of the evaluation unless there was a runtime error, then an error object is returned.

All the future backends that I maintain have been migrated to make use of FutureResult since a while. I'm now planning to deprecate the old-style "value", which CivisFuture is still using.

The outline to make use of FutureResult is:

  • Create your Future objects using future::Future(..., version = "1.8") (default is version = "1.7"). This will automatically result in a FutureResult object being returned.

  • Drop value.CivisFuture() - the default future:::value.Future() will take care of value() calls.

  • Instead, implement a result.CivisFuture() that receives the FutureResult object. You can most likely tweak your existing value.CivisFuture() to become this new method. Have a look at future.batchtools:::result.BatchtoolsFuture(), and future.callr:::result.CallrFuture() for examples. Hopefully, it's not too complicated.

In addition to the deprecation itself, there are several advantages of moving to FutureResult:s, e.g. standard output, messages and warnings produced when evaluating the future expression will be automatically relayed in the main R session in a uniform and consistent manner across future backends (without the future backend maintainer having to do anything). In the upcoming releases, there will also be support for basic benchmarking of futures and more.

Let me know if you have any questions.

ENH: fetch_logs should work for containers

This probably never really worked, but I'll report anyway:

job <- scripts_post_r(name = "test", source = "2+2")
run <- scripts_post_r_runs(job$id)
x <- scripts_get_r_runs(job$id, run$id)
fetch_logs(x)

Produces

 Error in do.call(f, c(args, limit = limit)) : 
  second argument must be a list 

dbplyr connection for Civis Redshift tables

Hey, I like being able to read in data from schemas within Civis. However, for large schemas and analyses I think it would be very helpful to have a connection that could be accessed from RPostgresSQL, as is shown in the Amazon documentation.

This would be very helpful when querying large databases/schema and would prevent using large amounts of ram / disk space.

BUG: Calling other functions from the local global environment fails in the CivisFuture platform environment

Trying to use plan("civis_platform") to parallelize my code and running into an issue where the function i'm wrapping in future() is not recognizing another function. Here is a simplified example that won't work

plan("civis_platform")
g <- function(df, x){dplyr::sample_n(df, x)}
f <- function(n_to_sample){g(data.frame(a = 1:20), n_to_sample)}

futs <- purrr::map(1:10, function(x){
  future(
    expr = f(x) #, 
    # globals = list(f = f, x = x, g = g) #Doesn't work if you use globals either :-/
    )
  })
purrr::map_dfr(futs, value)

ENH: minimize dependencies

We've accrued some dependencies over the course of development that aren't critical. Currently, we have 93 total recursive dependencies. That's too many for an infrastructure package, IMO. Civis employees can see a full dependency report here.

Minimizing the dependencies of the package will make it easier to install for our users by requiring a much smaller subset of CRAN. It will also enable creating smaller/leaner production R images for applications beyond the images we've used so far.

The current dependency list is:

Imports:
    dplyr (>= 0.7.0),
    dbplyr (>= 1.0.0),
    devtools,
    feather,
    future (>= 1.8.0),
    ggplot2,
    httr,
    jsonlite,
    lubridate,
    methods,
    memoise,
    purrr,
    roxygen2,
    stats,
    stringr,
    testthat,
    utils
Suggests:
    knitr,
    rmarkdown,
    mockery,
    R.utils,
    rstudioapi,
    yaml

Proposed list:

Imports:
    future (>= 1.8.0),
    httr,
    jsonlite,
    methods,
    memoise,
    stats,
    utils
Suggests:
   ggplot2, (civisML plots) 
   feather, (civisML)
   knitr,
   mockery,
   rmarkdown,
   roxygen2,
   R.utils,
   rstudioapi,
   testthat,
   yaml

Removals: devtools, dplyr,DBI, dbplyr, lubridate, purrr, stringr. I'm not sure about stringr, but it would be nice to remove.

There are some downstream risk in modifying common environments. In particular, removing dependencies from civis modifies the package environment of datascience-r. However, stable environment management should be handled in that repo, not this one. For example, datascience-r should handle the issue of moving feather and future to Suggests rather than Imports.

After this change, the only system dependency will be on xml2, which httr uses to parse HTML.

The impact on existing users should be low. Whenever they update R to a new minor version, they may have to manually install e.g. feather.

  • Move optional features to suggests
  • Remove DBI interface
  • Remove lubridate, purrr, stringr
  • Remove devtools, move roxygen2 to suggests

BUG: empty error messages on 503s

image (3)

Should not signal the error $ operator is invalid for automatic vectors (which is a confusing non-sequitur).

It's hard to catch these in the wild, so I'd probably try to reverse engineer the structure of return object that generates this error as a test case.

ggplot 2.3.0 compatibility

The tests for the CivisML plots fail with the development release of ggplot 2.3.0.

ERROR
    Running the tests in ‘tests/testthat.R’ failed.
    Last 13 lines of output:
      7: all.equal.numeric(c(m$metrics$metrics$y_yhat_plot$values),
p$data[[var_name]])
      8: attr.all.equal(target, current, tolerance = tolerance, scale
= scale, ...)
      9: mode(current)
      10: p$data[[var_name]]
      11: `[[.data.frame`(p$data, var_name)
      12: (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]]
else .subset2(x, i, exact = exact))(x,
             ..., exact = exact)

      ══ testthat results
═══════════════════════════════════════════════════════════
      OK: 919 SKIPPED: 0 FAILED: 2
      1. Error: decile plot for classification is produced
(@test_civis_ml_plot.R#22)
      2. Error: y_yhat plot for reg is produced (@test_civis_ml_plot.R#53)

      Error: testthat unit tests failed
      Execution halted

Civis-R Logo!

For the first issue on civis-r, I'd like to get submissions for a logo to represent civis-r. The logo will be used for documentation and possibly for creating stickers. Mainly, having a logo will just be fun 🎈 . Please put your submissions in Github comments below. The comment with the most ❤️ reactions will be the winner! Voting ends September 1st.

(to add a reaction, click on the smiley face in the upper right corner of the comment, then choose the heart for your vote! You can vote on as many submissions as you like.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.