civisanalytics / civis-r Goto Github PK
View Code? Open in Web Editor NEWCivis API Client for R: https://civisanalytics.com/products/civis-platform/
License: Other
Civis API Client for R: https://civisanalytics.com/products/civis-platform/
License: Other
Since platform polling calls like scripts_get_custom_runs
require two arguments (e.g. a job and script id) this function should be vectorized over two arguments not just one. Adding an optional .y
parameter to await_all
should do it.
FYI, adding the following to ~/.Renviron
:
## Supported since R (>= 3.4.0)
_R_CHECK_LENGTH_1_CONDITION_=true
## Supported since R (>= 3.6.0)
_R_CHECK_LENGTH_1_LOGIC2_=true
produced the following R CMD check --as-cran
error:
* using log directory ‘/home/hb/repositories/future/revdep/checks/civis/old/civis.Rcheck’
* using R version 3.6.0 (2019-04-26)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* using options ‘--no-manual --no-build-vignettes’
* checking for file ‘civis/DESCRIPTION’ ... OK
* this is package ‘civis’ version ‘1.6.1’
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘civis’ can be installed ... NOTE
Found the following notes/warnings:
Non-staged installation was used
See ‘/home/hb/repositories/future/revdep/checks/civis/old/civis.Rcheck/00install.out’ for details.
* checking installed package size ... OK
* checking package directory ... OK
* checking ‘build’ directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking R/sysdata.rda ... OK
* checking line endings in shell scripts ... OK
* checking installed files from ‘inst/doc’ ... OK
* checking files in ‘vignettes’ ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ... ERROR
Running ‘testthat.R’
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
Actual message: "'length(x) = 3 > 1' in coercion to 'logical(1)'"
── 3. Failure: retry on GET/PUT and 429 (@test_client_base.R#128) ─────────────
`call_api("GET", path, path_params, query_params, body_params)` threw an error with unexpected mes
sage.
Expected match: "429"
Actual message: "'length(x) = 3 > 1' in coercion to 'logical(1)'"
══ testthat results ═══════════════════════════════════════════════════════════
OK: 952 SKIPPED: 0 WARNINGS: 21 FAILED: 3
1. Failure: no retry on GET/PUT and code 403 (@test_client_base.R#117)
2. Failure: no retry on GET/PUT and code 403 (@test_client_base.R#117)
3. Failure: retry on GET/PUT and 429 (@test_client_base.R#128)
Error: testthat unit tests failed
Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking running R code from vignettes ... NONE
* checking re-building of vignette outputs ... SKIPPED
* DONE
Status: 1 ERROR, 1 NOTE
This could be upstreams - I haven't investigated.
read_civis(as.numeric(Sys.getenv("asdf")))
Produces
Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: api.civisanalytics.comNA
Request failed [ERROR]. Retrying in 1 seconds...
Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: api.civisanalytics.comNA
Request failed [ERROR]. Retrying in 2 seconds...
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: api.civisanalytics.comNA
Which is misleading. Like in the example, the error message occurs e.g. if you try to read an empty environment variable.
Basically the NA
is of type numeric so it looks like there's a correct IO method to dispatch to, i.e. read_civis.numeric
. From there, who knows how NA is dealt with in the API call construction!
We often use the lower case versions of get
, post
, etc. These should be upper case for correctness.
The default error handling in await
looks for error
or exception
in the returned response. This is NULL
for e.g. scripts_get_custom_runs
.
To increase the odds we fetch the logs automatically on error we can
error
or exception
civis:::get_script_fun
or fetch_logs
.Here is an example.
══ testthat results ═══════════════════════════════════════════════════════════
OK: 982 SKIPPED: 4 FAILED: 5
1. Error: DBItest: Driver: connect_format (@spec-driver-connect.R#21)
2. Error: DBItest: Driver: connect_bigint_integer (@spec-driver-connect.R#52)
3. Error: DBItest: Driver: connect_bigint_numeric (@spec-driver-connect.R#62)
4. Error: DBItest: Driver: connect_bigint_character (@spec-driver-connect.R#72)
5. Error: DBItest: Driver: connect_bigint_integer64 (@spec-driver-connect.R#85)
In R, the type of the object that API functions expect as a parameter is usually a list, not an object
. We can replace object
with list
when we regenerate the API docs.
Passing on report from CRAN — with the latest version of civis, CRAN saw this failure in the revdep check:
Check: whether package can be installed
New result: WARNING
Found the following significant warnings:
Warning: roxygen2 requires Encoding: UTF-8
It's failing the last CRAN checks here:
https://cran.r-project.org/web/checks/check_results_civis.html
It looks like the doc website isn't indexed by google. IT pointed me to this:
https://help.github.com/en/articles/search-engine-optimization-for-github-pages
Which might enable better docs discoverability.
When users want to scope API keys it can be challenging to figure out exactly which verbs and endpoints are required for a specific API client function. For example write_civis() includes a POST to /files, a POST to /jobs/:id/runs, and a POST to /imports, but the documentation doesn't make it clear all that's involved so the user has to do a lot of tinkering to figure out exactly what's involved.
I had a container script running successfully with this line:
deck_url <- civis::files_get(file_id, link_expires_at = (Sys.Date() + 2))
After the civis-r
package was updated to version 2.0.0
, my container script started failing (tagged to the latest
docker image).
This was the error message in the logs:
2019-06-25 07:21:19 PM Error in call_api("GET", path, path_params, query_params, body_params) :
2019-06-25 07:21:19 PM Not Found (HTTP 404). Failed to GET https://api.civisanalytics.com/files/40998851?link_expires_at=2019-06-28. The requested endpoint could not be found.
2019-06-25 07:21:19 PM Calls: <Anonymous> -> call_api -> stop_for_status
2019-06-25 07:21:19 PM Execution halted
2019-06-25 07:21:19 PM Title: QC Ballot Toplines 2019-06-26
It seems like having an argument for link_expires_at
caused it to get pasted to the output fileUrl
link?
When I re-ran the same script and removed the link_expires_at
argument with the line changed to this:
deck_url <- civis::files_get(file_id)
The container ran successfully with the civisanalytics/civisverse
Docker image and latest
image tag.
We have an un-exported utility function fetch_output
. It seems useful to generalize and expose it to users:
fetch_output(job_id, run_id, name = NULL, using = readRDS)
Providing the name
returns an object of that name. The file id is passed to read_civis
with an appropriate argument to using
.
Does it matter if scripts
, containers
or custom
is used?
await_all
is failing this example:
ids <- c(23820523, 23820523)
runs <- c(130390319, 130390319)
await_all(scripts_get_r_runs, .x = ids, .y = runs)
I think this has to do with passing args to call_once
and safe_call_once
. This will require a little refactoring of the existing call_once
and maybe the tests as well.
@Zephyraith would you be up to taking a look?
Due to the nature of the API, in some circumstances explicitly passing parameter=NULL
is different than not passing the parameter at all. This isn't the convention of the R language and is particularly problematic with the write_civis_file
, since the default is have files expire in 30 days.
@patr1ckm, I can think of two options to get around this:
write_civis_file
to a sentinel value like "DEFAULT"
Thoughts on this?
cc @harader for raising this issue offline.
devtools
depends on git2r
, which has required system dependencies. This can cause local installs of civis-r
to fail, and make installation more difficult.
We can replace devtools
with just a call to roxygen
for generating the documentation.
This gets some use.
Adding this argument allows for more successful imports easily because "
can be escaped either by ""
(double quote) or \
(escaped).
write.csv
by default uses ""
, so this feature flag will not lead to an internal inconsistency for csv files written by R destined for Redshift import.
For files generated by other systems, having this feature flag allows for successful imports of \
escaped files.
A cleaner solution would be to pass ...
in write_civis
to imports_post_syncs
rather than write.csv
, but that's a breaking change.
plot.civis_ml_classifier() currently makes it easy to create a decile plot for single-output CivisML classification objects. I'd like to expand this function so it can also support multi-output.
write_civis.numeric
imports a file id into redshift directly from redshift, however it doesn't expose the argument firstRowIsHeader
as an advanced option for the
You can try it by uploading file 26320073
.
For write_civis.numeric
we actually don't use ...
for anything, and could pass it to imports post syncs.
Since the models endpoints are deprecated, it would be helpful to provide helper functions in the API clients to assist with these tasks.
See: civisanalytics/civis-python#314 in civis-python.
The future package will release 1.8.0, with changes to error handling and return format. We should make sure the future backend still works with the new changes see here
Full notes:
https://github.com/HenrikBengtsson/future/blob/develop/NEWS
We have a report where uploading large csv directly to redshift via write_civis
fails when the file is large, at least 12 gb
. The error is
Error in curl::curl_fetch_memory(url, handle = handle) :
SSLRead() return error -36
I think in principle we could upload the file using the multipart upload endpoints instead of directly.
Just a little blurb and a point to the longer form blog posts should work well.
job <- scripts_post_r(name = "test", source = "stop('err')")
run <- scripts_post_r_runs(job$id)
await(scripts_get_r_runs, job$id, run$id)
Produces:
Error: scripts_get_r_runs(id = 10602571, NA = 77083111):
The object is:
scripts_get_r_runs(job$id, run$id)
<civis_api>
List of 7
$ id : int 77083111
$ rId : int 10602571
$ state : chr "failed"
$ isCancelRequested: logi FALSE
$ startedAt : chr "2018-03-30T15:33:59.000Z"
$ finishedAt : chr "2018-03-30T15:34:37.000Z"
$ error : NULL
- attr(*, "class")= chr [1:2] "civis_api" "list"
I'm proposing that read_civis.civis_script
have two additional arguments: regex
and using
. This design mimics the current behavior of read_civis
which is to return values whenever it is unambiguous to do so.
The default is using = NULL
, which returns a named list of file ids. This doesn't return values. It does return a helpful result if you don't know what the object types are that are posted in the script.
out <- read_civis(civis_script(234))
val <- read_civis(out$name, using = reader))
> out
$name
[1] 1
With a regex
, read_civis.civis_script
returns a list of ids maching the regex.
out <- read_civis(civis_script(234), regex = ‘name’)
# always returns a list
val <- read_civis(out[[1]], using = reader)
vals <- lapply(out, read_civis, using = reader)
With using = reader
, read_civis
returns the values directly.
Here reader
is used to read all outputs:
val <- read_civis(civis_script(234), using = reader)
A more relevant case is with regex
, which you can use to filter outputs that match a file extension, and then parse them with the right using
:
val <- read_civis(civis_script(234), regex = ‘.csv’, using = read.csv))
val <- read_civis(civis_script(234), regex = ‘.rds’)
val <- read_civis(civis_script(234), regex = ‘.json’, using = jsonlite::fromJSON)
A really helpful feature of the API is that json values are simply returned as-is in out$value
. So any non-null using
can simply return JsonValues
directly without attempting to parse/download the output.
Thoughts on this @keithing ? Is it too magical?
Should have transfer_table, get_table_id, and refresh_table in there.
We can now use hyperband with stacking, we should not throw an error in create_and_run_model
.
@elsander can we hyperband with sparse_logistic
or sparse_linear_regressor
?
Write deploy_service
that
Only question is whether trying to redeploy an app that is currently deploying should
I'm leaning toward option 3 so that a user doesn't have to write error handling / signalling / retrying around this state. The risk having a bunch of requests pile up to redeploy the app. I don't think it's a big risk though, it might just be confusing.
@keithing, @skomanduri thoughts?
A CivisML model has File and Project run outputs, each of which need to be shared to effectively share the "model". This helper will transparently handle sharing the run outputs. Include JSONValue in case a future version of CivisML uses that. The functions are patterned after the autogenerated API sharing endpoints.
See: civisanalytics/civis-python#315 in civis-python.
What would you think about having this generic? It would write the data frame as a csv to S3 by default. Extra args get passed to write.csv
, and row.names = FALSE
will be the default.
This would also work for tbl_df
, tbl
produced from e.g. read_csv
since the objects inherit from data.frame
:
xx <- read_csv("~/test/data.csv")
str(xx)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of 3 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ genre_id: int 13 9 4 18 10 16 19 3 8 7 ...
$ genre : chr "Musical" "Fantasy" "Children" "War" ...
- attr(*, "spec")=List of 2
..$ cols :List of 3
.. ..$ X1 : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ genre_id: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ genre : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
And
str(dplyr::as.tbl(data.frame(a=1)), 1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 1 variable:
$ a: num 1
Since we know that data.frame
is a tabular thing, it makes sense to store it on S3 in a platform independent way by default rather than using the R specific serialization. Some space would be traded off.
An alternative is feather
, which would preserve factors better. However, the other side still has to know that feather
is being used, and I'm not sure this really solves the problem.
@keithing for comments.
get_metric() currently makes it easy to extract model metric metadata from single-output CivisML objects. I'd like to expand this function so it can also support multi-output (particularly for deciles).
It can take some time for logs to populate from e.g. child processes. fetch_logs
could poll the logging endpoint by default to make sure all logs are fetched. Since this only happens on errors, the time spent polling is probably not too troublesome.
Hi, author of future here. Since future 1.8.0 (2018-04-08) we have support for returning FutureResult objects from futures. A FutureResult object can hold:
The default is still that futures return the old-style "value", i.e. the value of the evaluation unless there was a runtime error, then an error object is returned.
All the future backends that I maintain have been migrated to make use of FutureResult since a while. I'm now planning to deprecate the old-style "value", which CivisFuture is still using.
The outline to make use of FutureResult is:
Create your Future objects using future::Future(..., version = "1.8")
(default is version = "1.7"
). This will automatically result in a FutureResult object being returned.
Drop value.CivisFuture()
- the default future:::value.Future()
will take care of value()
calls.
Instead, implement a result.CivisFuture()
that receives the FutureResult object. You can most likely tweak your existing value.CivisFuture()
to become this new method. Have a look at future.batchtools:::result.BatchtoolsFuture()
, and future.callr:::result.CallrFuture()
for examples. Hopefully, it's not too complicated.
In addition to the deprecation itself, there are several advantages of moving to FutureResult:s, e.g. standard output, messages and warnings produced when evaluating the future expression will be automatically relayed in the main R session in a uniform and consistent manner across future backends (without the future backend maintainer having to do anything). In the upcoming releases, there will also be support for basic benchmarking of futures and more.
Let me know if you have any questions.
The models endpoints have been deprecated. It is possibly confusing to users who try to use them after discovering the endpoints in the documentation.
This probably never really worked, but I'll report anyway:
job <- scripts_post_r(name = "test", source = "2+2")
run <- scripts_post_r_runs(job$id)
x <- scripts_get_r_runs(job$id, run$id)
fetch_logs(x)
Produces
Error in do.call(f, c(args, limit = limit)) :
second argument must be a list
Hey, I like being able to read in data from schemas within Civis. However, for large schemas and analyses I think it would be very helpful to have a connection that could be accessed from RPostgresSQL
, as is shown in the Amazon documentation.
This would be very helpful when querying large databases/schema and would prevent using large amounts of ram / disk space.
scripts_get_sql_runs
returns the file id in $output
, we can utilize this in query civis to automatically return the file ID from an unloaded query. This would likely spare some IO in a few places.
The first link on this page: https://civisanalytics.github.io/civis-r/
The full documentation is hosted here.
just links back to the same page.
Trying to use plan("civis_platform")
to parallelize my code and running into an issue where the function i'm wrapping in future()
is not recognizing another function. Here is a simplified example that won't work
plan("civis_platform")
g <- function(df, x){dplyr::sample_n(df, x)}
f <- function(n_to_sample){g(data.frame(a = 1:20), n_to_sample)}
futs <- purrr::map(1:10, function(x){
future(
expr = f(x) #,
# globals = list(f = f, x = x, g = g) #Doesn't work if you use globals either :-/
)
})
purrr::map_dfr(futs, value)
publish_addin
needs some documentation!
hist.ml() currently makes it easy to plot a histogram of the predicted scores or probabilities for single-output CivisML objects. I'd like to expand this function so it can also support multi-output.
We've accrued some dependencies over the course of development that aren't critical. Currently, we have 93 total recursive dependencies. That's too many for an infrastructure package, IMO. Civis employees can see a full dependency report here.
Minimizing the dependencies of the package will make it easier to install for our users by requiring a much smaller subset of CRAN. It will also enable creating smaller/leaner production R images for applications beyond the images we've used so far.
The current dependency list is:
Imports:
dplyr (>= 0.7.0),
dbplyr (>= 1.0.0),
devtools,
feather,
future (>= 1.8.0),
ggplot2,
httr,
jsonlite,
lubridate,
methods,
memoise,
purrr,
roxygen2,
stats,
stringr,
testthat,
utils
Suggests:
knitr,
rmarkdown,
mockery,
R.utils,
rstudioapi,
yaml
Proposed list:
Imports:
future (>= 1.8.0),
httr,
jsonlite,
methods,
memoise,
stats,
utils
Suggests:
ggplot2, (civisML plots)
feather, (civisML)
knitr,
mockery,
rmarkdown,
roxygen2,
R.utils,
rstudioapi,
testthat,
yaml
Removals: devtools, dplyr,DBI, dbplyr, lubridate, purrr, stringr
. I'm not sure about stringr
, but it would be nice to remove.
There are some downstream risk in modifying common environments. In particular, removing dependencies from civis
modifies the package environment of datascience-r
. However, stable environment management should be handled in that repo, not this one. For example, datascience-r
should handle the issue of moving feather
and future
to Suggests
rather than Imports
.
After this change, the only system dependency will be on xml2
, which httr
uses to parse HTML.
The impact on existing users should be low. Whenever they update R to a new minor version, they may have to manually install e.g. feather
.
In the documentation, add instructions about creating the .Renviron
file, and restarting Rstudio
if necessary.
The tests for the CivisML plots fail with the development release of ggplot 2.3.0.
ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
7: all.equal.numeric(c(m$metrics$metrics$y_yhat_plot$values),
p$data[[var_name]])
8: attr.all.equal(target, current, tolerance = tolerance, scale
= scale, ...)
9: mode(current)
10: p$data[[var_name]]
11: `[[.data.frame`(p$data, var_name)
12: (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]]
else .subset2(x, i, exact = exact))(x,
..., exact = exact)
══ testthat results
═══════════════════════════════════════════════════════════
OK: 919 SKIPPED: 0 FAILED: 2
1. Error: decile plot for classification is produced
(@test_civis_ml_plot.R#22)
2. Error: y_yhat plot for reg is produced (@test_civis_ml_plot.R#53)
Error: testthat unit tests failed
Execution halted
For the first issue on civis-r
, I'd like to get submissions for a logo to represent civis-r
. The logo will be used for documentation and possibly for creating stickers. Mainly, having a logo will just be fun 🎈 . Please put your submissions in Github comments below. The comment with the most ❤️ reactions will be the winner! Voting ends September 1st.
(to add a reaction, click on the smiley face in the upper right corner of the comment, then choose the heart for your vote! You can vote on as many submissions as you like.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.