greenplum-db / pivotalr Goto Github PK
View Code? Open in Web Editor NEWAn convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.
Home Page: https://pivotalsoftware.github.io/gp-r/
An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.
Home Page: https://pivotalsoftware.github.io/gp-r/
To make test-driven-development easier, we need a continuous.test function. When running, it automatically detects any changes in the source code, testing files and user doc, which contains the examples, and re-run tests or examples when necessary.
The code is in "continuous_test" branch, but not finished.
May I use predict function with other output type?
like predict(object, newdata, type="response")
How do I do in fucntion generic.cv?
Hi again. Wondering if something like:
WHERE date_a-date_b>365
using
Data[Data$date_a-Data$date_b>365,]
or
SELECT MAX(date) FROM X
using
Data$maxdate <- max(Data$date)
will be supported in the future.
Hi,
Quick test case :
`# connected already to a postgres 12 db
x <- as.db.data.frame(abalone)
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: invalid value for parameter "client_min_messages": "panic"
HINT: Available values: debug5, debug4, debug3, debug2, debug1, log, notice, warning, error.
)`
Now, looking in the code I can see this in utilities.R:
.suppress.warnings <- function (conn.id, level = "panic") { msg.level <- .set.msg.level(level, conn.id = conn.id) warn.r <- getOption("warn") options(warn = -1) list(msg.level = msg.level, warn.r = warn.r, conn.id = conn.id) }
(Sorry, can't get the hang of code formatting on this thing). Anwyay, my postgresql.conf shows this:
#client_min_messages = notice # values in order of decreasing detail: # debug5 # debug4 # debug3 # debug2 # debug1 # log # notice # warning # error
So no more 'panic'.
This is just a mild annoyance, the commands get executed, it's just that the screen fills up with warnings.
Cheers,
Iulian
I'm getting behavior I don't understand. When using ggplot2 package w/ mapproj dependencies (ggmap or coordinate_map functions) I'm getting odd errors.
This only occurs after PivotalR has been loaded and persists after
detach("package:PivotalR",unload=TRUE)
However, if I restart R and do above mapproj functions without PivotalR loaded the code executes correctly... I'm not sure how to reproduce other than suggesting to
library(ggmap)
Map <- get_map()
ggmap(Map)
Should pull up a map of Houston if PivotalR is not invoked and error out if it has been...
(Just want to add that I think this PivotalR is wonderful and progressing nicely!)
I was trying to run a logistic regression with grouping columns using madlib.glm and it was much slower than running it directly on the database. I wondered why and after looking at the code and the activity on the database, it seems that to populate the field "nobs" in the regression result there is a count done for each group, which can take a long time. Wouldn't it be better to simply use the fields "num_rows_processed" and "num_rows_skipped" from the madlib output table?
Add a command to automatically detect GitHub update and prompt the user to install the latest code from master branch.
Or provide a command pivotalr.update to let the user choose to check the updates.
Hi Guys,
I am seeing some issues when I try to run madlib related commands via PivotalR. For example when I hit madlib.summary
Command
madlib.summary(trainingData)
Error
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: )
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: )
Error in db.data.frame(out.tbl, conn.id = conn.id(x), verbose = FALSE) :
No such object in the connection 6
At the same time I am able to run madlib.version. Also I am able to execute dim(trainingData). What am i doing wrong? Any help would be much appreciated
vcov
does not have NULL handling in versions < 0.1.15.19. This also causes margins
to raise an error when it is applied onto data sets that have NULL values.
I have tested PivotalR on a variety of platforms and can never get past establishing a connection. In the case of R 3.6.3 on Ubuntu, I get the follow core dump.
> conn.id <- db.connect(conn.pkg = "RPostgreSQL", dbname = 'dbname=mydb sslmode=require', host = 'hostname', port = 5432, user = 'username', password = 'password)
Loading required package: DBI
Created a connection to database with ID 1
Warning: Madlib does not exist in database dbname=mydb sslmode=require schema madlib.
So all functions starting with 'madlib.' will not work.
But you can still use other functions with just a few exceptions.
*** caught segfault ***
address 0x4, cause 'memory not mapped'
Traceback:
1: postgresqlFetch(res, n, ...)
2: func(res, n)
3: .db.fetch.rpostgresql(res = res$res, n = n)
4: eval(parse(text = command))
5: eval(parse(text = command))
6: .db.fetch(res, nrows)
7: doTryCatch(return(expr), name, parentenv, handler)
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(.db.fetch(res, nrows), error = function(c) c, finally = { .db.clearResult(res) .restore.warnings(warns)})
11: db.q(paste("set application_name = '", .this.pkg.name, "'", sep = ""), conn.id = result, verbose = FALSE)
12: db.connect(conn.pkg = "RPostgreSQL", dbname = "dbname=mydb sslmode=require", host = "hostname", port = 5432, user = "username", password = "password"
Dont support R again? Got below message from R 3.6.3:
> install.packages("PivotalR")
Warning in install.packages :
package ‘PivotalR’ is not available (for R version 3.6.3)
pmml(model) should work with madlib.lm and madlib.glm. Maybe madlib.elnet too
I know that MADLib supports association rules, and i would like to use it from R. Is there a possibility to implement this feature?
Currently PivotalR cannot support sorting multiple columns, esp. with different ordering constraints like one asc, one desc.
For example, the equivalent form of PivotalR cannot be done:
select name from tbl order by score desc, date asc;
Please assign the fix to @victorfang
hello,
I'm using freshly compiled version of PivotalR, on mac os, french locale
when running summary on a table object, I get
"Error in db.data.frame(tbl.output, conn.id = conn.id, verbose = FALSE) :
No such object in the connection 1"
I think the problem comes from the .unique.string() function used in temp table creation.
Specifically in the use of strptime(date(),"%c")
The point is that this function is dependent of the computer locale LC_TIME, so mine was fr_FR.UTF-8, and strptime(date(),"%c") returned "NA"
After this, the name found in Greenplum seems to be lowercased in "na", so the temp table name is not recognized when retrieving it from R
If I change with Sys.setlocale("LC_TIME", "C") it works :)
Solution might be to be locale-independent in the construction of temp table names
Hope it helps
(btw, kudos for your package. discovered it today & it really rocks)
Cheers
Julien
The arraydb.to.arrayr function from the utility-generic.R file has an erroneous parameter list or implementation. The parameter "n" is not used in the code at all. The document claims the following:
n: An integer, default is 1. If the input has ‘NA’ instead of a
string as one element of a string array, how many ‘NA’'s
should be returned so that a valid array can be returned.
There should be as many ‘NA’ as the number of elements in
other output rows without ‘NA’.
But this does not reflect the implementation.
When estimating large models, it would be nice to get some feedback on progress like the verbose flag in MADlib. E.g. printing the iteration number for logistic regression.
Happy to make the necessary changes.
Passing a list of factors to INDICES results in an error.
preview(by(x,x[1],mean))
works as expected
preview(by(x,x[1:2],mean))
does not. I don't see any documentation that suggests that using mutliple arguments in a GROUP BY clause isn't yet supported.
my specific error is as follows:
Error in preview(by(Geo, list(Geo$state_code, Geo$county_code), count)) :
error in evaluating the argument 'x' in selecting a method for function 'preview': Error in x[state_code == "12" && county_code == "000", ] :
error in evaluating the argument 'i' in selecting a method for function '[': Error in state_code == "12" && county_code == "000" :
invalid 'x' type in 'x && y'
Not sure why the x type would be invalid...
db.list()
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: no connection to the server
)
Error in if (gsub(".*(HAWQ).*", "\\1", dbms.str, perl = T) == "HAWQ") { :
argument is of length zero
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
Could not create executeselect version()
Implement plr.map, plr.reduce, plr.filter, plr.flatmap functions, which utilize PL/R to create functions directly in Postgres databases.
Using these functions as the basis for building more complicated algorithms.
Currently a single dbdriver of Rpostgresql allows 16 max connections by default. PivotalR uses this default value, leading to the below error.
"Error in postgresqlNewConnection(drv, ...) :
RS-DBI driver: (cannot allocate a new connection -- maximum of 16 connections already opened)"
It would be useful to increase this since Rpostgresql has an actual limit of 100 connections.
I am currently working on a model that contains many categorical features. To cast them as factors, I need to call pivotalr_df$factor_column <- as.factor(pivotalr_df$factor_column)
on 90 columns. Usually one would for example loop over an array of factors using the df[, factor_column]
subsetting mechanism. If I try to call pivotalr_df[, factor_column] <- as.factor(pivotalr_df[, factor_column])
I get the following error: Error in is(i, "db.Rquery") (from <text>#1) : argument "i" is missing
.
Is it possible to add an convenience function to transform a vector of column names to factors?
Thanks and best
For older versions of MADlib (<= 1.4.1), crossprod creates a temporary plpgsql function in database. During this process, /tmp directory is used. However, if the user does not have the permission to write in /tmp, the function fails.
The straightforward fix is to completely avoid using /tmp at all.
This has been done in v0.1.15.24.
Bug report from Mark Malamut:
I believe I encountered what appears to be a small bug in this package that I cannot get around when trying to incorporate it into my own package which provides a db access layer for our team. Here is what I believe is the problem.
Note that I am using the most recent version from CRAN 0.1.18.3
The PivotalR package implements cbind and therefore overrides the base cbind
when the fully library is loaded using the library(PivotalR)
directive. From the behavior I have seen it appears that a reference to cbind
in your package uses whatever cbind
is available in the current environment as opposed to explicitly using PivotalR::cbind
. I believe this to be the case as my code works fine when specifying library(PivotalR)
but fails during db.disconnect
with the following when the library is not fully loaded.
Sample Code
#library(PivotalR)
connect = function(){
PivotalR::db.connect(host='x1', user='x2', password='x3', dbname='x4', quick=TRUE)
}
disconnect = function(id){
PivotalR::db.disconnect(conn.id=id, verbose=TRUE)
}
query = function(str){
id = connect()
result = PivotalR::db.q(str, conn.id=id, verbose=TRUE, nrows='all')
disconnect(id)
result
}
X = query(“SELECT * FROM <table>”)
Error in .localVars$conn.type[[conn.pkg]] <- .localVars$conn.type[[conn.pkg]][-which(.localVars$conn.type[[conn.pkg]] == : replacement has length zero
When the library PivotalR is loaded, I see the following informational messages and my code works without error:
> library("PivotalR", lib.loc="~/R/win-library/3.3")
Attaching package: ‘PivotalR’
The following objects are masked from ‘package:stats’:
sd, var
The following object is masked from ‘package:base’:
cbind
Which is why I believe there is an implicit call to cbind in your code ( It doesn’t seem that sd or var are the issue). Note that the first time it fails in db.disconnect but if you rerun without resetting the environment subsequent calls will fail in db.connect with the following error message:
Error in .localVars$conn.type[["rpostgresql"]] <- c(.localVars$conn.type[["rpostgresql"]], : more elements supplied than there are to replace
Is any method like sql method insert into one row or a query?
Or like rbind method?
File R/db_conn-generic.R
has line
eval(parse(text = paste("suppressMessages(library(", conn.pkg, "))")))
,
where conn.pkg
can currently only be RPostgreSQL
. This pollutes the namespace by attaching all objects from RPostgreSQL
. We should avoid this by using the scope operator ::
instead of attaching the objects.
When I try to do a db.connect (host = host, etc. etc.) I am getting an error that the pg_hba.conf has no entry for my connection with SSL = off. Which is correct, we use SSL to connect to the database. However, i can't seem to find where one sets SSL = TRUE
I need to write an R data.frame back to a GreenPlum database after doing some work on it. This is usually no problem in something like RMYSql, you use the method dbWriteTable(connection, "table name", df, append = T, , row.names = FALSE , overwrite = F, allow.keywords = FALSE) and boom, data into the table.
When I try to do this from the PivotalR connection I get the following error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"numeric", "character"'
To reproduce
require(PivotalR)
cid <- db.connect(port = 5432, host = host, dbname = dbname, user = user, password = password)
db.q('create table test_table as select * from schema.tableA limit 10;', conn.id = cid)
results_to_append <- db.q('select * from schema.tableA limit 10;', conn.id = cid)
dbWriteTable(cid, "schema.test_table", results_to_append, append = T, , row.names = FALSE , overwrite = F, allow.keywords = FALSE)
Does the PivotalR package actually implement data loading? I can't find any documentation.
The current error as shown in R:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘as.db.data.frame’ for signature ‘"function
While debugging the error using as.db.data.frame I learned the dataframe I was using was not in the environment. Would suggest making this error more specific to missing data.
People see errors in versions < 0.1.15.14 when doing regressions (lm and glm) on a db.data.frame object that points to a VIEW in the database.
> madlib.lm(first_cell_lac ~ first_cell_idsac, data=rti)
Error in .local(x, table.name, verbose, ...) :
unused argument (factor.full = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE))
This is introduced by a previous change.
When I used generic.cv for madlib.elnet, I got some problem, as follow.
x <- matrix(rnorm(100*20),100,20)
y <- rbinom(100,1,0.5)
dat <- data.frame(x, y)
delete("eldata")
z <- as.db.data.frame(dat, "eldata", verbose = FALSE)
g <- generic.cv(
train = function (data, alpha, lambda) {
madlib.elnet(y ~ ., data = data, family = "binomial",
alpha = alpha, lambda = lambda)
},
predict = predict, #function(fit,newdata) {predict(fit,newdata,type="response")} ,
metric = function (predicted, data) {
lk(mean((data$y - predicted)^2))
},
data = z,
params = list(alpha=1, lambda=seq(0,0.2,0.1)),
k = 3, find.min = TRUE)
Computation in-database ...
Cutting the data row-wise into 3 pieces ...
Running on fold 1 now ...
parameters 1, 0 ...
parameters 1, 0.1 ...
parameters 1, 0.2 ...
Running on fold 2 now ...
parameters 1, 0 ...
parameters 1, 0.1 ...
parameters 1, 0.2 ...
Running on fold 3 now ...
parameters 1, 0 ...
parameters 1, 0.1 ...
parameters 1, 0.2 ...
Done.
Fitting the best model using the whole data set ...
Executing in database connection 1:
select madlib.elastic_net_train('"eldata"', 'madlib_temp_5d38a1b6_056d_a5aba2_90e7cb4f218a', '("y")::boolean', 'array["X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12","X13","X14","X15","X16","X17","X18","X19","X20"]', 'binomial', , , TRUE, NULL, 'fista', 'use_active_set = f', NULL, 100, 1e-04)
hi,
when I run a by processing in Hawq using a custom function, i only get results for 100 rows. Any ideas as to why that occurs?
if i do something like this
samples_tissue_mu_logcpm
<- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type),mean )`
Then i get the correct number of rows 112636
however if i submit my own function or a non-standard function as below I only get 100 results retrieved
samples_tissue_logcpm_q_lo <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type), FUN=function(x) { y <- lookat(x, nrows=NULL) return(quantile(y, prob=0.25 )) })
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.