greenplum-db / pivotalr Goto Github PK

View Code? Open in Web Editor NEW

122.0 122.0 47.0 4.95 MB

An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.

Home Page: https://pivotalsoftware.github.io/gp-r/

R 97.17% Shell 0.01% Makefile 0.04% C++ 2.67% C 0.12%

pivotalr's People

Contributors

Stargazers

Watchers

pivotalr's Issues

Continuous test support

To make test-driven-development easier, we need a continuous.test function. When running, it automatically detects any changes in the source code, testing files and user doc, which contains the examples, and re-run tests or examples when necessary.

The code is in "continuous_test" branch, but not finished.

generic.cv.R

May I use predict function with other output type?
like predict(object, newdata, type="response")
How do I do in fucntion generic.cv?

date operations in sql wrapper

Hi again. Wondering if something like:

WHERE date_a-date_b>365

using

Data[Data$date_a-Data$date_b>365,]

SELECT MAX(date) FROM X

using

Data$maxdate <- max(Data$date)

will be supported in the future.

Postgresql 12 doesn't have a 'panic' level any more

Hi,

Quick test case :
`# connected already to a postgres 12 db

x <- as.db.data.frame(abalone)
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: invalid value for parameter "client_min_messages": "panic"
HINT: Available values: debug5, debug4, debug3, debug2, debug1, log, notice, warning, error.
)`

Now, looking in the code I can see this in utilities.R:
.suppress.warnings <- function (conn.id, level = "panic") { msg.level <- .set.msg.level(level, conn.id = conn.id) warn.r <- getOption("warn") options(warn = -1) list(msg.level = msg.level, warn.r = warn.r, conn.id = conn.id) }

(Sorry, can't get the hang of code formatting on this thing). Anwyay, my postgresql.conf shows this:
#client_min_messages = notice # values in order of decreasing detail: # debug5 # debug4 # debug3 # debug2 # debug1 # log # notice # warning # error
So no more 'panic'.
This is just a mild annoyance, the commands get executed, it's just that the screen fills up with warnings.
Cheers,
Iulian

conflict with mapproj???

I'm getting behavior I don't understand. When using ggplot2 package w/ mapproj dependencies (ggmap or coordinate_map functions) I'm getting odd errors.

This only occurs after PivotalR has been loaded and persists after

detach("package:PivotalR",unload=TRUE)

However, if I restart R and do above mapproj functions without PivotalR loaded the code executes correctly... I'm not sure how to reproduce other than suggesting to

load PivotalR and generate dataset from preview()
try the following

library(ggmap)
Map <- get_map()
ggmap(Map)

Should pull up a map of Houston if PivotalR is not invoked and error out if it has been...

(Just want to add that I think this PivotalR is wonderful and progressing nicely!)

nobs in madlib regressions

I was trying to run a logistic regression with grouping columns using madlib.glm and it was much slower than running it directly on the database. I wondered why and after looking at the code and the activity on the database, it seems that to populate the field "nobs" in the regression result there is a count done for each group, which can take a long time. Wouldn't it be better to simply use the fields "num_rows_processed" and "num_rows_skipped" from the madlib output table?

Add a command to automatically detect GitHub update

Add a command to automatically detect GitHub update and prompt the user to install the latest code from master branch.

Or provide a command pivotalr.update to let the user choose to check the updates.

PIvotalR - Error executing madlib summary (and other) commands

Hi Guys,

I am seeing some issues when I try to run madlib related commands via PivotalR. For example when I hit madlib.summary

Command
madlib.summary(trainingData)

Error
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: )
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: )
Error in db.data.frame(out.tbl, conn.id = conn.id(x), verbose = FALSE) :
No such object in the connection 6

At the same time I am able to run madlib.version. Also I am able to execute dim(trainingData). What am i doing wrong? Any help would be much appreciated

NULL handling of vcov and margins

vcov does not have NULL handling in versions < 0.1.15.19. This also causes margins to raise an error when it is applied onto data sets that have NULL values.

Core Dump on db.connect

I have tested PivotalR on a variety of platforms and can never get past establishing a connection. In the case of R 3.6.3 on Ubuntu, I get the follow core dump.

> conn.id <- db.connect(conn.pkg = "RPostgreSQL", dbname = 'dbname=mydb sslmode=require', host = 'hostname', port = 5432, user = 'username', password = 'password)

Loading required package: DBI
Created a connection to database with ID 1 

Warning: Madlib does not exist in database dbname=mydb sslmode=require schema madlib.
So all functions starting with 'madlib.' will not work.
But you can still use other functions with just a few exceptions.

 *** caught segfault ***
address 0x4, cause 'memory not mapped'

Traceback:
 1: postgresqlFetch(res, n, ...)
 2: func(res, n)
 3: .db.fetch.rpostgresql(res = res$res, n = n)
 4: eval(parse(text = command))
 5: eval(parse(text = command))
 6: .db.fetch(res, nrows)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(.db.fetch(res, nrows), error = function(c) c, finally = {    .db.clearResult(res)    .restore.warnings(warns)})
11: db.q(paste("set application_name = '", .this.pkg.name, "'", sep = ""),     conn.id = result, verbose = FALSE)
12: db.connect(conn.pkg = "RPostgreSQL", dbname = "dbname=mydb sslmode=require",     host = "hostname", port = 5432,     user = "username", password = "password"

package ‘PivotalR’ is not available (for R version 3.6.3)

Dont support R again? Got below message from R 3.6.3:

> install.packages("PivotalR")
Warning in install.packages :
  package ‘PivotalR’ is not available (for R version 3.6.3)

Support pmml package

pmml(model) should work with madlib.lm and madlib.glm. Maybe madlib.elnet too

Association rule mining

I know that MADLib supports association rules, and i would like to use it from R. Is there a possibility to implement this feature?

Enhance the sorting multiple columns

Currently PivotalR cannot support sorting multiple columns, esp. with different ordering constraints like one asc, one desc.
For example, the equivalent form of PivotalR cannot be done:

select name from tbl order by score desc, date asc;

Please assign the fix to @victorfang

.unique.string() : problem with LC_TIME locale

hello,
I'm using freshly compiled version of PivotalR, on mac os, french locale
when running summary on a table object, I get

"Error in db.data.frame(tbl.output, conn.id = conn.id, verbose = FALSE) :
No such object in the connection 1"

I think the problem comes from the .unique.string() function used in temp table creation.
Specifically in the use of strptime(date(),"%c")

The point is that this function is dependent of the computer locale LC_TIME, so mine was fr_FR.UTF-8, and strptime(date(),"%c") returned "NA"

After this, the name found in Greenplum seems to be lowercased in "na", so the temp table name is not recognized when retrieving it from R

If I change with Sys.setlocale("LC_TIME", "C") it works :)

Solution might be to be locale-independent in the construction of temp table names

Hope it helps

(btw, kudos for your package. discovered it today & it really rocks)

Cheers

Julien

arraydb.to.arrayr function takes a parameter (n) and doesn't use it

The arraydb.to.arrayr function from the utility-generic.R file has an erroneous parameter list or implementation. The parameter "n" is not used in the code at all. The document claims the following:

n: An integer, default is 1. If the input has ‘NA’ instead of a
string as one element of a string array, how many ‘NA’'s
should be returned so that a valid array can be returned.
There should be as many ‘NA’ as the number of elements in
other output rows without ‘NA’.

But this does not reflect the implementation.

Verbose output feature request

When estimating large models, it would be nice to get some feedback on progress like the verbose flag in MADlib. E.g. printing the iteration number for logistic regression.

Happy to make some changes

Happy to make the necessary changes.

by (group-by

Passing a list of factors to INDICES results in an error.

preview(by(x,x[1],mean))

works as expected

preview(by(x,x[1:2],mean))

does not. I don't see any documentation that suggests that using mutliple arguments in a GROUP BY clause isn't yet supported.

my specific error is as follows:

Error in preview(by(Geo, list(Geo$state_code, Geo$county_code), count)) :
error in evaluating the argument 'x' in selecting a method for function 'preview': Error in x[state_code == "12" && county_code == "000", ] :
error in evaluating the argument 'i' in selecting a method for function '[': Error in state_code == "12" && county_code == "000" :
invalid 'x' type in 'x && y'

Not sure why the x type would be invalid...

Connection Failed

db.list()
Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: no connection to the server
)
Error in if (gsub(".*(HAWQ).*", "\\1", dbms.str, perl = T) == "HAWQ") { : 
  argument is of length zero
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
  Could not create executeselect version()

PL/R support

Implement plr.map, plr.reduce, plr.filter, plr.flatmap functions, which utilize PL/R to create functions directly in Postgres databases.

Using these functions as the basis for building more complicated algorithms.

Increase maximum limit on concurrent connections

Currently a single dbdriver of Rpostgresql allows 16 max connections by default. PivotalR uses this default value, leading to the below error.

"Error in postgresqlNewConnection(drv, ...) :
RS-DBI driver: (cannot allocate a new connection -- maximum of 16 connections already opened)"

It would be useful to increase this since Rpostgresql has an actual limit of 100 connections.

Creating many factors with as.factor

I am currently working on a model that contains many categorical features. To cast them as factors, I need to call pivotalr_df$factor_column <- as.factor(pivotalr_df$factor_column) on 90 columns. Usually one would for example loop over an array of factors using the df[, factor_column] subsetting mechanism. If I try to call pivotalr_df[, factor_column] <- as.factor(pivotalr_df[, factor_column]) I get the following error: Error in is(i, "db.Rquery") (from <text>#1) : argument "i" is missing.

Is it possible to add an convenience function to transform a vector of column names to factors?

Thanks and best

Avoid using /tmp in 'crossprod'

For older versions of MADlib (<= 1.4.1), crossprod creates a temporary plpgsql function in database. During this process, /tmp directory is used. However, if the user does not have the permission to write in /tmp, the function fails.

The straightforward fix is to completely avoid using /tmp at all.

This has been done in v0.1.15.24.

Namespace for cbind not set correctly

Bug report from Mark Malamut:

I believe I encountered what appears to be a small bug in this package that I cannot get around when trying to incorporate it into my own package which provides a db access layer for our team. Here is what I believe is the problem.

Note that I am using the most recent version from CRAN 0.1.18.3

The PivotalR package implements cbind and therefore overrides the base cbind when the fully library is loaded using the library(PivotalR) directive. From the behavior I have seen it appears that a reference to cbind in your package uses whatever cbind is available in the current environment as opposed to explicitly using PivotalR::cbind. I believe this to be the case as my code works fine when specifying library(PivotalR) but fails during db.disconnect with the following when the library is not fully loaded.

Sample Code

#library(PivotalR)
 
connect = function(){
  PivotalR::db.connect(host='x1', user='x2', password='x3', dbname='x4', quick=TRUE)
}
 
disconnect = function(id){
  PivotalR::db.disconnect(conn.id=id, verbose=TRUE)
}
 
query = function(str){
  id = connect()
  result = PivotalR::db.q(str, conn.id=id, verbose=TRUE, nrows='all')
  disconnect(id)
 
  result
}
 
X = query(“SELECT * FROM <table>”)
Error in .localVars$conn.type[[conn.pkg]] <- .localVars$conn.type[[conn.pkg]][-which(.localVars$conn.type[[conn.pkg]] == : replacement has length zero
 
 
 
When the library PivotalR is loaded, I see the following informational messages and my code works without error:
> library("PivotalR", lib.loc="~/R/win-library/3.3")
 
Attaching package: ‘PivotalR’
 
The following objects are masked from ‘package:stats’:
 
    sd, var
 
The following object is masked from ‘package:base’:
 
    cbind

Which is why I believe there is an implicit call to cbind in your code ( It doesn’t seem that sd or var are the issue). Note that the first time it fails in db.disconnect but if you rerun without resetting the environment subsequent calls will fail in db.connect with the following error message:

Error in .localVars$conn.type[["rpostgresql"]] <- c(.localVars$conn.type[["rpostgresql"]], : more elements supplied than there are to replace

append data

Is any method like sql method insert into one row or a query?
Or like rbind method?

Avoid attaching RPostgreSQL to connect to database

File R/db_conn-generic.R has line
eval(parse(text = paste("suppressMessages(library(", conn.pkg, "))"))),

where conn.pkg can currently only be RPostgreSQL. This pollutes the namespace by attaching all objects from RPostgreSQL. We should avoid this by using the scope operator :: instead of attaching the objects.

How do I set the db.connect to use SSL?

When I try to do a db.connect (host = host, etc. etc.) I am getting an error that the pg_hba.conf has no entry for my connection with SSL = off. Which is correct, we use SSL to connect to the database. However, i can't seem to find where one sets SSL = TRUE

dbWriteTable seems to be broken when using PivotalR package

I need to write an R data.frame back to a GreenPlum database after doing some work on it. This is usually no problem in something like RMYSql, you use the method dbWriteTable(connection, "table name", df, append = T, , row.names = FALSE , overwrite = F, allow.keywords = FALSE) and boom, data into the table.

When I try to do this from the PivotalR connection I get the following error

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"numeric", "character"'

To reproduce

require(PivotalR)
cid <- db.connect(port = 5432, host = host, dbname = dbname, user = user, password = password)
db.q('create table test_table as select * from schema.tableA limit 10;', conn.id = cid)
results_to_append <- db.q('select * from schema.tableA limit 10;', conn.id = cid)
dbWriteTable(cid, "schema.test_table", results_to_append, append = T, , row.names = FALSE , overwrite = F, allow.keywords = FALSE)

Does the PivotalR package actually implement data loading? I can't find any documentation.

Missing dataframe in as.db.data.frame error not specific

The current error as shown in R:

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘as.db.data.frame’ for signature ‘"function

While debugging the error using as.db.data.frame I learned the dataframe I was using was not in the environment. Would suggest making this error more specific to missing data.

Using VIEW as the data causes regressions error

People see errors in versions < 0.1.15.14 when doing regressions (lm and glm) on a db.data.frame object that points to a VIEW in the database.

> madlib.lm(first_cell_lac ~ first_cell_idsac, data=rti)
Error in .local(x, table.name, verbose, ...) : 
  unused argument (factor.full = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE))

This is introduced by a previous change.

generic.cv.R

When I used generic.cv for madlib.elnet, I got some problem, as follow.

x <- matrix(rnorm(100*20),100,20)
y <- rbinom(100,1,0.5)
dat <- data.frame(x, y)
delete("eldata")
z <- as.db.data.frame(dat, "eldata", verbose = FALSE)

g <- generic.cv(
  train = function (data, alpha, lambda) {
    madlib.elnet(y ~ ., data = data, family = "binomial",
                 alpha = alpha, lambda = lambda)
  },
  predict = predict, #function(fit,newdata) {predict(fit,newdata,type="response")} ,
  metric = function (predicted, data) {
    lk(mean((data$y - predicted)^2))
  },
  data = z,
  params = list(alpha=1, lambda=seq(0,0.2,0.1)),
  k = 3, find.min = TRUE)

Computation in-database ...
Cutting the data row-wise into 3 pieces ...
Running on fold 1 now ...
parameters 1, 0 ...
parameters 1, 0.1 ...
parameters 1, 0.2 ...
Running on fold 2 now ...
parameters 1, 0 ...
parameters 1, 0.1 ...
parameters 1, 0.2 ...
Running on fold 3 now ...
parameters 1, 0 ...
parameters 1, 0.1 ...
parameters 1, 0.2 ...
Done.
Fitting the best model using the whole data set ...
Executing in database connection 1:

select madlib.elastic_net_train('"eldata"', 'madlib_temp_5d38a1b6_056d_a5aba2_90e7cb4f218a', '("y")::boolean', 'array["X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12","X13","X14","X15","X16","X17","X18","X19","X20"]', 'binomial', , , TRUE, NULL, 'fista', 'use_active_set = f', NULL, 100, 1e-04)

Results from by processing limited to 100 rows

hi,
when I run a by processing in Hawq using a custom function, i only get results for 100 rows. Any ideas as to why that occurs?

if i do something like this
samples_tissue_mu_logcpm <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type),mean )`

Then i get the correct number of rows 112636
however if i submit my own function or a non-standard function as below I only get 100 results retrieved
samples_tissue_logcpm_q_lo <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type), FUN=function(x) { y <- lookat(x, nrows=NULL) return(quantile(y, prob=0.25 )) })

greenplum-db / pivotalr Goto Github PK

pivotalr's People

Contributors

Stargazers

Watchers

Forkers

pivotalr's Issues

Recommend Projects

Recommend Topics

Recommend Org