tidyverse / dbplyr Goto Github PK
View Code? Open in Web Editor NEWDatabase (DBI) backend for dplyr
Home Page: https://dbplyr.tidyverse.org
License: Other
Database (DBI) backend for dplyr
Home Page: https://dbplyr.tidyverse.org
License: Other
@pssguy commented on Aug 29, 2017, 10:22 PM UTC:
I am attempting to use an MSSQL connection and hitting this issue
I first replicate the example from the dbplyr intro
library(odbc)
library(DBI)
library(tidyverse)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "iris", iris)
iris2 <- tbl(con, "iris")
iris2 %>%
arrange(Species) %>%
select(Sepal.Length)
This works fine
Now with an MSSQL connection
con2 <- dbConnect(odbc::odbc(), DSN = "premier")
DBI::dbWriteTable(con2, "iris", iris, overwrite=TRUE)
iris9 <- tbl(con2, "iris")
test1 <-iris9 %>%
arrange(Species) %>%
select(Sepal.Length)
test1
# Error: <SQL> 'SELECT TOP 1000 "Sepal.Length" AS "Sepal.Length" FROM (SELECT * FROM "iris" ORDER BY "Species") "odeiuzmtqh"' nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Reversing the select and arrange commands
test2 <-iris9 %>%
select(Sepal.Length) %>%
arrange(Species)
test2 # No error
This works in a simple example but I will sometimes need to arrange data prior to other processes in a pipe
When I look at the problem code it does not exactly replicate error i.e no mention of Top 1000
test1 %>% show_query()
# <SQL>
# SELECT "Sepal.Length" AS "Sepal.Length"
# FROM (SELECT *
# FROM "iris"
# ORDER BY "Species") "omlcgfsrjt"
Trying several alternatives in SQL
SELECT "Sepal.Length" AS "Sepal.Length"
FROM (SELECT *
FROM "iris"
ORDER BY "Species") "omlcgfsrjt"
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Failed to execute SQL chunk
something that looks like error code
SELECT TOP 1000 "Sepal.Length" AS "Sepal.Length"
FROM (SELECT *
FROM "iris"
ORDER BY "Species") "omlcgfsrjt"
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Failed to execute SQL chunk
replacing top 1000 in sub-query produces the desired output
SELECT "Sepal.Length" AS "Sepal.Length"
FROM (SELECT TOP 1000 *
FROM "iris"
ORDER BY "Species") "omlcgfsrjt"
Not sure if this is an error or just something that has not yet been addressed for MSSQL.
p.s. Why no issues option under dbplyr?
This issue was moved by hadley from tidyverse/dplyr/issues/3062.
@sverchkov commented on Jul 20, 2018, 7:10 PM UTC:
Using the .data pronoun in filter, mutate, and transmute does not work with database tables
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
### Using mutate
# works
mutate(mtcars, foo = .data$cyl)
# also works
mutate(mtcars2, foo = cyl)
# doesn't
mutate(mtcars2, foo = .data$cyl)
# Error: Column `cyl` not found in `.data`
### Using transmute
# works
transmute(mtcars, blah = .data$cyl)
# also works
transmute(mtcars, blah = cyl)
# doesn't
transmute(mtcars2, blah = .data$cyl)
# Error: Column `cyl` not found in `.data`
### Using transmute to select
# works
transmute(mtcars, .data$cyl)
# also works
transmute(mtcars2, cyl)
# doesn't
transmute(mtcars2, .data$cyl)
# Error: Column `cyl` not found in `.data`
# select doesn't have this issue, this works
select( mtcars2, .data$cyl )
### Using filter
# works
filter( mtcars, .data$cyl > 4 )
# also works
filter( mtcars2, cyl > 4 )
# doesn't
filter( mtcars2, .data$cyl > 4 )
# Error: Column `cyl` not found in `.data`
(I actually ran into this when running code with a Postgres database, so it isn't DB-specific)
This is different from issue#3370, since everything works without the .data pronoun, but might be somehow related.
This issue was moved by krlmlr from tidyverse/dplyr#3722.
In the "Introduction to dbplyr" the author recommends the following when deciding to use a database:
As well as working with local in-memory data stored in data frames, dplyr also works with remote on-disk data stored in databases. This is particularly useful in two scenarios:
Your data is already in a database.
You have so much data that it does not all fit into memory simultaneously and you need to use some external storage engine.
(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating).
This is actually not completely accurate as it's literally written.
I suggest this should be qualified to give the following reasons as well:
Or alternatively, make no mention of when a database should be used or not as this is a large subject that depends on too many variables to mention in an introduction
@mungojam commented on Oct 24, 2017, 7:28 PM UTC:
The copy_to now supports a src_sql source as well as destination, but currently does this via a tibble. This makes it performant only for datasets that fit into available RAM.
It would be good if some sort of streaming or batching were implemented so that the peak RAM usage is much smaller. This is so that we can use dplyr to do an initial large extract from a large remote table into a local SQLite database (for example).
Perhaps this fits better as a DBI feature but it's nice to be able to do a custom select on the source table using dplyr.
library(dplyr, warn.conflicts = FALSE)
src1 <- src_sqlite("fileDb1.db", create = TRUE)
src2 <- src_sqlite("fileDb2.db", create = TRUE)
# Create from dataframe
iris_1 <- copy_to(src1, iris, "iris1")
# Create from tbl in another data, currently goes via in-memory tibble containing full dataset
db2 <- copy_to(src2, iris, 'data2', temporary = FALSE)
This issue was moved by hadley from tidyverse/dplyr/issues/3163.
@dhait commented on Jun 21, 2018, 4:19 PM UTC:
Neither case_when or if_else work correctly if the target MSSQL database is SQLServer version 2008 or lower. The "IIF" function to which they are translated was not implemented until SQLServer version 2012.
This issue was moved by batpigandme from tidyverse/dplyr/issues/3690.
@hadley commented on Sep 8, 2017, 9:19 PM UTC:
library(dplyr)
library(RPostgreSQL)
myRedshift <- src_postgres(Sys.getenv("DBNAME"),
host = Sys.getenv("JDBCURL"),
port = Sys.getenv("PORT"),
user = Sys.getenv("USER"),
password = Sys.getenv("PW"))
experiment_id <- '455145790a164522d1f4f47f2df40c76728c84dd'
date_start <- '2017-06-08'
date_end <- '2017-06-15'
experiment_data <- tbl(myRedshift, sql("SELECT *, DATE_TRUNC('day', ts)::date AS dt FROM p_experiments WHERE name LIKE '%variation::select%'")) %>%
dplyr::filter(experimentid == experiment_id &
ts >= date_start &
ts < date_end &
!is.na(userid) &
variation != 'unknown')
cohorts <- experiment_data %>%
group_by(userid, variation) %>%
summarise(enrolment_date = min(dt))
reading_time <- tbl(myRedshift, "etl_reading") %>%
dplyr::filter(createdate >= date_start &
createdate < date_end &
time_spent >= 0) %>%
inner_join(cohorts, by = "userid") %>%
dplyr::filter(createdate >= enrolment_date) %>%
group_by(variation, userid) %>%
summarise(rt = sum(time_spent)) %>%
ungroup()
reading_time_all_allocated <- cohorts %>%
left_join(reading_time, by = c("userid", "variation")) %>%
collect(n = Inf)
produces this SQL (via explain()):
SELECT "TBL_LEFT"."userid" AS "userid", "TBL_LEFT"."variation" AS "variation", "TBL_LEFT"."enrolment_date" AS "enrolment_date", "TBL_RIGHT"."rt" AS "rt"
FROM (SELECT "userid", "variation", MIN("dt") AS "enrolment_date"
FROM (SELECT *
FROM (SELECT *, DATE_TRUNC('day', ts)::date AS dt FROM p_experiments WHERE name LIKE '%variation::select%') "ihznadkpcf"
WHERE ("experimentid" = '455145790a164522d1f4f47f2df40c76728c84dd' AND "ts" >= '2017-06-08' AND "ts" < '2017-06-15' AND NOT((("userid") IS NULL)) AND "variation" != 'unknown')) "biptkfpowo"
GROUP BY "userid", "variation") "TBL_LEFT"
LEFT JOIN (SELECT "variation", "userid", SUM("time_spent") AS "rt"
FROM (SELECT *
FROM (SELECT "TBL_LEFT"."platform" AS "platform", "TBL_LEFT"."uuid" AS "uuid", "TBL_LEFT"."userid" AS "userid", "TBL_LEFT"."storyid" AS "storyid", "TBL_LEFT"."partid" AS "partid", "TBL_LEFT"."createdate" AS "createdate", "TBL_LEFT"."read_percent" AS "read_percent", "TBL_LEFT"."time_spent" AS "time_spent", "TBL_RIGHT"."variation" AS "variation", "TBL_RIGHT"."enrolment_date" AS "enrolment_date"
FROM (SELECT *
FROM "etl_reading"
WHERE ("createdate" >= '2017-06-08' AND "createdate" < '2017-06-15' AND "time_spent" >= 0.0)) "TBL_LEFT"
INNER JOIN (SELECT "userid", "variation", MIN("dt") AS "enrolment_date"
FROM (SELECT *
FROM (SELECT *, DATE_TRUNC('day', ts)::date AS dt FROM p_experiments WHERE name LIKE '%variation::select%') "bvqcfsxath"
WHERE ("experimentid" = '455145790a164522d1f4f47f2df40c76728c84dd' AND "ts" >= '2017-06-08' AND "ts" < '2017-06-15' AND NOT((("userid") IS NULL)) AND "variation" != 'unknown')) "nnwriawvlp"
GROUP BY "userid", "variation") "TBL_RIGHT"
ON ("TBL_LEFT"."userid" = "TBL_RIGHT"."userid")
) "xmtbxcquyw"
WHERE ("createdate" >= "enrolment_date")) "ostngvabdq"
GROUP BY "variation", "userid") "TBL_RIGHT"
ON ("TBL_LEFT"."userid" = "TBL_RIGHT"."userid" AND "TBL_LEFT"."variation" = "TBL_RIGHT"."variation")
Comparing this to the actual SQL I'd write to get these results:
reading_time_all_allocated <- dbGetQuery(con, glue(
"WITH cohorts AS (
SELECT
p.variation,
p.userid,
DATE(MIN(p.ts)) AS enrolment_date
FROM p_experiments p
WHERE
p.experimentid = '{experiment_id}'
AND p.name LIKE '%variation::select%'
AND p.userid IS NOT NULL
AND p.ts BETWEEN '{date_start}' AND '{date_end}'
AND p.variation != 'unknown'
GROUP BY p.variation, p.userid),
reading_time AS (
SELECT
c.userid,
c.variation,
COALESCE(SUM(er.time_spent), 0) AS rt
FROM etl_reading er
JOIN cohorts c ON er.userid = c.userid
WHERE
er.createdate >= c.enrolment_date
AND er.createdate BETWEEN '{date_start}' AND ('{date_end}' - INTEGER '1')
AND er.time_spent >= 0
GROUP BY c.userid, c.variation)
SELECT
c.userid,
c.variation,
COALESCE(r.rt, 0) AS rt
FROM cohorts c
LEFT OUTER JOIN reading_time r ON c.userid = r.userid AND c.variation = r.variation",
experiment_id = experiment_id,
date_start = date_start,
date_end = date_end))
Probably because of the extra subqueries inside the joins.
This issue was moved by hadley from tidyverse/dplyr/issues/3091.
I noticed that I get weird results when I count
a column with very different value counts. This is a super minor issue. Please don't feel compelled to address it!
The issue doesn't seem to happen with SQLite. I'm assuming it's something about printing the S3: Integer64
column.
Basic setup of a table to count:
con <- DBI::dbConnect(RPostgres::Postgres(),
user = "my_user", dbname = "my_db")
df <- data.frame(x = c(rep(1, 1000), rep(2, 10)))
dplyr::copy_to(con, df, temporary = TRUE)
remote_df <- dplyr::tbl(con, "df")
dplyr::count(remote_df, x)
Result:
# Source: lazy query [?? x 2]
# Database: postgres [[email protected]:5432/my_db]
x n
<dbl> <S3: integer64>
1 2 " 10" <- What's up with the extra spaces?
2 1 1000
Expected result:
# Source: lazy query [?? x 2]
# Database: postgres [[email protected]:5432/my_db]
x n
<dbl> <S3: integer64>
1 2 10
2 1 1000
@sz-cgt commented on Nov 9, 2017, 11:01 PM UTC:
Simple example
db <- dbConnect(odbc::odbc(), "PRD")
db %>%
tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>%
select_all() %>%
show_query()
produces
<SQL>
SELECT `zzz10.mcidx` AS `zzz10.mcidx`, `zzz10.dcg_dmgrphc_risk_nbr` AS `zzz10.dcg_dmgrphc_risk_nbr`, `zzz10.dcg_unwgtd_retrospctv_risk_nbr` AS `zzz10.dcg_unwgtd_retrospctv_risk_nbr`, `zzz10.dcg_wgtd_retrospctv_risk_nbr` AS `zzz10.dcg_wgtd_retrospctv_risk_nbr`, `zzz10.dcg_unwgtd_prsptv_risk_nbr` AS `zzz10.dcg_unwgtd_prsptv_risk_nbr`, `zzz10.dcg_wgtd_prsptv_risk_nbr` AS `zzz10.dcg_wgtd_prsptv_risk_nbr`, `zzz10.dcg_incrd_prd_bgn_dt` AS `zzz10.dcg_incrd_prd_bgn_dt`, `zzz10.dcg_incrd_prd_end_dt` AS `zzz10.dcg_incrd_prd_end_dt`, `zzz10.file_name` AS `zzz10.file_name`, `zzz10.load_id` AS `zzz10.load_id`
FROM 1001_2.dxcg_raw
Note the missing table alias zzz10
Remove the schema reference and it works
db <- dbConnect(odbc::odbc(), "PRD")
odbc::dbSendStatement(db2, "use 1001_2")
db %>%
tbl("dxcg_raw") %>%
select_all() %>%
show_query()
produces
<SQL>
SELECT `mcidx` AS `mcidx`, `dcg_dmgrphc_risk_nbr` AS `dcg_dmgrphc_risk_nbr`, `dcg_unwgtd_retrospctv_risk_nbr` AS `dcg_unwgtd_retrospctv_risk_nbr`, `dcg_wgtd_retrospctv_risk_nbr` AS `dcg_wgtd_retrospctv_risk_nbr`, `dcg_unwgtd_prsptv_risk_nbr` AS `dcg_unwgtd_prsptv_risk_nbr`, `dcg_wgtd_prsptv_risk_nbr` AS `dcg_wgtd_prsptv_risk_nbr`, `dcg_incrd_prd_bgn_dt` AS `dcg_incrd_prd_bgn_dt`, `dcg_incrd_prd_end_dt` AS `dcg_incrd_prd_end_dt`, `file_name` AS `file_name`, `load_id` AS `load_id`
FROM `dxcg_raw`
Swap to PostgreSQL and it works
con <- dbConnect(odbc::odbc(), "Validation")
con %>%
tbl(dbplyr::in_schema("c1022_1", "clmdx")) %>%
select_all() %>%
show_query()
produces
<SQL>
SELECT "claim_id" AS "claim_id", "sv_stat" AS "sv_stat", "serviceyearmonth" AS "serviceyearmonth", "claim_in_network" AS "claim_in_network", "dx" AS "dx"
FROM c1022_1.clmdx
Here's the connection information for the two systems (minus any personal details of course)
> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"
$sourcename
[1] "PRD"
$servername
[1] "Hive"
$drivername
[1] "Hortonworks Hive ODBC Driver"
$odbc.version
[1] "03.80.0000"
$driver.version
[1] "2.1.10.1014"
$odbcdriver.version
[1] "03.80"
$supports.transactions
[1] FALSE
attr(,"class")
[1] "Hive" "driver_info" "list"
> odbc::dbGetInfo(con)
$dbname
[1] "validation"
$dbms.name
[1] "PostgreSQL"
$db.version
[1] "9.6.3"
$sourcename
[1] "Validation"
$servername
[1] "lprdawswks0001.cedargatepartners.pvc"
$drivername
[1] "PSQLODBC35W.DLL"
$odbc.version
[1] "03.80.0000"
$driver.version
[1] "10.00.0000"
$odbcdriver.version
[1] "03.51"
$supports.transactions
[1] TRUE
attr(,"class")
[1] "PostgreSQL" "driver_info" "list"
This issue was moved by krlmlr from tidyverse/dplyr/issues/3200.
@OssiLehtinen commented on Jun 7, 2018, 10:29 AM UTC:
Using copy_to from dbplyr with in_schema name definition results in an error message, when working with DBI 1.0.0
copy_to(con, mtcars, name=in_schema("TEMP", "MTCARS"))
#> Error: Can't unquote TEMP.MTCARS
The code for method dbUnquoteIdentifier in DBI's quote.R seems to have changed, which I suspect is causing the problem. I think it doesn't like the "." produced by in_schema.
The previous version I was using (DBI 0.8.0) didn't have this issue.
Some technical specs: I'm running under Redhat 7, and the database is IBM db2 via odbc.
Please let me know if some essintial info is missing, and thanks for any help in advance!
Br, Ossi
This issue was moved by krlmlr from tidyverse/dplyr/issues/3646.
I'm not sure if this will be fixed with tidyverse/dplyr#3433. Also, I don't understand why the second example gives a warning. Thanks @Romanik!
library(tidyverse)
dbplyr::memdb_frame(a = 1, b = 2) %>%
summarize_all(~mean(.))
#> Error in mean(.): object 'a' not found
dbplyr::memdb_frame(a = 1, b = 2) %>%
summarize_all(mean)
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.22.0 [:memory:]
#> a b
#> <dbl> <dbl>
#> 1 1 2
dbplyr::memdb_frame(a = 1, b = 2) %>%
summarize_if(funs(is.numeric), ~mean(.))
#> Applying predicate on the first 100 rows
#> Error in mean(.): object 'a' not found
Created on 2018-07-25 by the reprex package (v0.2.0).
@chrnin commented on Feb 12, 2018, 3:51 PM UTC:
When I try to insert a pretty (but not so) large dataframe (that I got from a ~80k rows csv file, avg 140M of data), postgresql exhausts memory (8G + 2G swap) because of a large unique insert query with all the data inside causing my system to randomly kill some processes (rsession, or postmaster, or anyone with large memory consumption).
The example below crashes on my computer with 8G ram, to reproduce on bigger system, increase the a range accordingly.
tibble(
a=1:400000,
b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
) %>% dplyr::copy_to(
dest = my_beloved_postgresql_database,
df=.,
name='kamikaze_table',
temporary=FALSE,
overwrite=TRUE
)
edit: I'm using dplyr 0.7.4 on ubuntu 17.10.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3355.
@bkkkk commented on Dec 27, 2017, 8:07 AM UTC:
When collecting the results of an SQL query with duplicated column names, it would be nice if you got the warning as soon as possible so you don't waste time having to rerun long queries after fixing the column names.
I will often write and test out queries in DataGrip or another SQL client and then move it into R, in the SQL client duplicate column names are fine, but for good reason dplyr doesn't allow this, warning:
Error: Columns [COLUMN NAMES] must have unique names
and then dumping the result.
sql("SELECT
main.id,
main.phone,
others.phone
FROM consumers main
INNER JOIN consumers others ON others.id = main.id") %>%
tbl(src = my_source) %>%
collect(n = Inf)
It would be nice to have dplyr warn the user, automatically assign column names (phone_v1, phone_v2) and collected the results into a tibble, or in lieu of that immediately warn the user upon executing the command if possible.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3262.
@ablack3 commented on Dec 1, 2017, 4:48 PM UTC:
I have an ODBC connection to a Vertica database and am using dplyr to query it.
Vertica has types DOUBLE PRECISION and NUMERIC but does not have type DOUBLE.
When I try to convert a variable of type CHARACTER to NUMERIC dplyr generates incorrect SQL code.
Unfortunately I cannot create a fully reproducible example because the issue involves a Vertica database connection.
con <- DBI:: dbConnect(odbc::odbc(), dsn = "Vertica")
dbplyr::translate_sql(as.numeric(myVar), con = con)
CAST("myVar" AS DOUBLE)
This will not work since DOUBLE is not a valid type in Vertica.
DOUBLE PRECISION or NUMERIC should work.
I get the same issue when I use as.double()
dbplyr::translate_sql(as.double(myVar), con = con)
CAST("myVar" AS DOUBLE)
However if I leave off the connection argument the correct SQL is generated
dbplyr::translate_sql(as.numeric(myVar))
CAST("myVar" AS NUMERIC)
How do I fix this issue and get started adding support for Vertica?
Thanks!
We’re working to add support for more databases over time, but adding support on your own is surprisingly easy. Submit an issue to dplyr and we’ll help you get started.
https://blog.rstudio.com/2017/06/27/dbplyr-1-1-0/
This issue was moved by batpigandme from tidyverse/dplyr/issues/3236.
@jnolis commented on Nov 28, 2017, 6:36 PM UTC:
Occasionally, I will need to use pmin or pmax to replace values above or below a threshold, for instance:
require(dplyr)
data_frame(X = 1:10) %>%
mutate(Y = pmax(X,5))
However, when I use dbplyr to try and do this using a SQL Server table, I get the following error:
tbl(con,"OrderInfo") %>%
mutate(Test = pmax(Revenue,0))
Error: <SQL> 'SELECT TOP 10 "CustomerId", "OrderId", "ClientOrderId", "Channel", "OrderDate", "Revenue", "Cost", MAX("Revenue", 0.0) AS "Test"
FROM "OrderInfo"'
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]The MAX function requires 1 argument(s).
As you can see, the SQL code generator assumes that I want to do a pairwise MAX, however SQL Server does not support that (only taking the maximum over a single column).
A cursory Stack Overflow search suggests that to do this operation a case statement could be used. I think either the SQL code generator should either be updated to do this, or an error should be generated at the time of calling the mutate.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3227.
@namarkus commented on Jun 20, 2018, 2:47 PM UTC:
case_when works just fine with dataframes. Using it in a database environment, I'm missing the possibility of defining an else-clause:
translate_sql(case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot",
TRUE ~ "other"
))
translates to:
(...) WHEN (TRUE) THEN ('other') END
whereas it should translate to:
(...) ELSE ('other') END
This issue was moved by batpigandme from tidyverse/dplyr/issues/3688.
Oracle DBs have a limit on the length of lists (such as those used with IN
clauses) of 1000 items. See the discussion here.
Is this something that dbplyr
would be open to addressing? If so, would one of the two methods suggested in the article above be preferable?
For example this throws an error:
lkp <- as.character(1:1001)
oracle_tbl %>%
filter(ID %in% lkp)
Error in new_result(connection@ptr, statement) :
nanodbc/nanodbc.cpp:1344: HY000: ORA-01795: maximum number of expressions in a list is 1000
While this is ok:
lkp <- as.character(1:1000)
oracle_tbl %>%
filter(ID %in% lkp)
@andreaspano commented on Oct 29, 2017, 8:23 PM UTC:
I would like to contribute at the translation of dplyr for ibm netezza
This issue was moved by hadley from tidyverse/dplyr/issues/3178.
I'm running into an issue when using mutate and ifelse (or if) on a SQL Server backend. The following sample query:
df <- tbl(con, in_schema("schema", "table_name")) %>%
mutate(field = ifelse(field %in% c('test1', 'test2', 'test3'), 'string1', 'string2') %>%
collect()
... will give the following translation:
SELECT CASE WHEN (("field" IN ('test1', 'test2', 'test3')) = 'TRUE') THEN ('string1')
WHEN (("field" IN ('test1', 'test2', 'test3')) = 'FALSE') THEN ('string2')
END AS "field"
Which will return an error. The correct translation should be:
SELECT CASE WHEN ("field" IN ('test1', 'test2', 'test3')) THEN 'string1'
WHEN ("field" NOT IN ('test1', 'test2', 'test3')) THEN 'string2'
END AS "field"
Will correcting this in the mssql_if function create any issues in other queries, or is there a better way to do mutate(ifelse()) in mssql?
Line 225 in 477480e
@AjarKeen commented on Jan 8, 2018, 9:44 PM UTC:
cc @javierluraschi since I've only tested this with sparklyr
, not with other dbplyr
backends.
library(dplyr)
library(sparklyr)
library(nycflights13)
# local version
flights %>%
group_by(carrier) %>%
summarize(count_num = n(),
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
ratio = mean_dep_delay / count_num) %>%
arrange(carrier)
#> # A tibble: 16 x 4
#> carrier count_num mean_dep_delay ratio
#> <chr> <int> <dbl> <dbl>
#> 1 9E 18460 16.7 0.000906
#> 2 AA 32729 8.59 0.000262
#> 3 AS 714 5.80 0.00813
#> 4 B6 54635 13.0 0.000238
#> 5 DL 48110 9.26 0.000193
#> 6 EV 54173 20.0 0.000368
#> 7 F9 685 20.2 0.0295
#> 8 FL 3260 18.7 0.00574
#> 9 HA 342 4.90 0.0143
#> 10 MQ 26397 10.6 0.000400
#> 11 OO 32 12.6 0.393
#> 12 UA 58665 12.1 0.000206
#> 13 US 20536 3.78 0.000184
#> 14 VX 5162 12.9 0.00249
#> 15 WN 12275 17.7 0.00144
#> 16 YV 601 19.0 0.0316
# Spark version
sc <- spark_connect(master = "local")
flights_sdf <- copy_to(sc, flights, "flights")
flights_sdf %>%
group_by(carrier) %>%
summarize(count_num = n(),
mean_dep_delay = mean(dep_delay),
ratio = mean_dep_delay / count_num) %>%
collect()
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`mean_dep_delay`' given input columns: [dest, dep_delay, distance, dep_time, minute, carrier, origin, sched_arr_time, month, arr_time, day, flight, sched_dep_time, time_hour, arr_delay, air_time, hour, tailnum, year]; line 1 pos 81;
#> 'Aggregate [carrier#38], [carrier#38, count(1) AS count_num#447L, avg(dep_delay#34) AS mean_dep_delay#448, ('mean_dep_delay / 'count_num) AS ratio#449]
#> +- SubqueryAlias flights
#> +- LogicalRDD [year#29, month#30, day#31, dep_time#32, sched_dep_time#33, dep_delay#34, arr_time#35, sched_arr_time#36, arr_delay#37, carrier#38, flight#39, tailnum#40, origin#41, dest#42, air_time#43, distance#44, hour#45, minute#46, time_hour#47]
#>
#> at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
#> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
#> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$transformUp$1.apply(TreeNode.scala:289)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$transformUp$1.apply(TreeNode.scala:289)
#> at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$4.apply(TreeNode.scala:306)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#> at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$4.apply(TreeNode.scala:306)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1(QueryPlan.scala:289)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1$1.apply(QueryPlan.scala:293)
#> at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
#> at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
#> at scala.collection.immutable.List.foreach(List.scala:381)
#> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
#> at scala.collection.immutable.List.map(List.scala:285)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1(QueryPlan.scala:293)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$6.apply(QueryPlan.scala:298)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
#> at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
#> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
#> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
#> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
#> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
#> at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
#> at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
#> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
#> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
#> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
#> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
#> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
#> at java.lang.reflect.Method.invoke(Method.java:498)
#> at sparklyr.Invoke$.invoke(invoke.scala:102)
#> at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
#> at sparklyr.StreamHandler$.read(stream.scala:62)
#> at sparklyr.BackendHandler.channelRead0(handler.scala:52)
#> at sparklyr.BackendHandler.channelRead0(handler.scala:14)
#> at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
#> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#> at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
#> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
#> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
#> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
#> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
#> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
#> at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
#> at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
#> at java.lang.Thread.run(Thread.java:748)
# Spark workaround
flights_sdf %>%
group_by(carrier) %>%
mutate(count_num = n(),
mean_dep_delay = mean(dep_delay),
ratio = mean_dep_delay / count_num) %>%
summarize(count_num = mean(count_num),
mean_dep_delay = mean(mean_dep_delay),
ratio = mean(ratio)) %>%
arrange(carrier) %>%
collect()
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Warning: Missing values are always removed in SQL.
#> Use `avg(x, na.rm = TRUE)` to silence this warning
#> # A tibble: 16 x 4
#> carrier count_num mean_dep_delay ratio
#> <chr> <dbl> <dbl> <dbl>
#> 1 9E 18460 16.7 0.000906
#> 2 AA 32729 8.59 0.000262
#> 3 AS 714 5.80 0.00813
#> 4 B6 54635 13.0 0.000238
#> 5 DL 48110 9.26 0.000193
#> 6 EV 54173 20.0 0.000368
#> 7 F9 685 20.2 0.0295
#> 8 FL 3260 18.7 0.00574
#> 9 HA 342 4.90 0.0143
#> 10 MQ 26397 10.6 0.000400
#> 11 OO 32.0 12.6 0.393
#> 12 UA 58665 12.1 0.000206
#> 13 US 20536 3.78 0.000184
#> 14 VX 5162 12.9 0.00249
#> 15 WN 12275 17.7 0.00144
#> 16 YV 601 19.0 0.0316
My session info after I run the above:
> devtools::session_info()
Session info ---------------------------------------------------------------------
setting value
version R version 3.4.1 (2017-06-30)
system x86_64, linux-gnu
ui RStudio (1.1.383)
language (EN)
collate en_US.UTF-8
tz <NA>
date 2018-01-08
Packages -------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1)
backports 1.1.2 2017-12-13 cran (@1.1.2)
base * 3.4.1 2017-09-07 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.1)
bindr 0.1 2016-11-13 CRAN (R 3.4.1)
bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.1)
broom 0.4.3 2017-11-20 cran (@0.4.3)
callr 1.0.0 2016-06-18 CRAN (R 3.4.1)
cli 1.0.0 2017-11-05 CRAN (R 3.4.1)
clipr 0.4.0 2017-11-03 CRAN (R 3.4.1)
compiler 3.4.1 2017-09-07 local
config 0.2 2016-08-02 CRAN (R 3.4.1)
crayon 1.3.4 2017-09-16 CRAN (R 3.4.1)
datasets * 3.4.1 2017-09-07 local
DBI 0.7 2017-06-18 CRAN (R 3.4.1)
dbplyr 1.2.0 2018-01-03 cran (@1.2.0)
devtools 1.13.4 2017-11-09 CRAN (R 3.4.1)
digest 0.6.13 2017-12-14 cran (@0.6.13)
dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.1)
evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
foreign 0.8-69 2017-06-22 CRAN (R 3.4.1)
glue 1.2.0 2017-10-29 CRAN (R 3.4.1)
graphics * 3.4.1 2017-09-07 local
grDevices * 3.4.1 2017-09-07 local
grid 3.4.1 2017-09-07 local
htmltools 0.3.6 2017-04-28 CRAN (R 3.4.1)
httpuv 1.3.5 2017-07-04 CRAN (R 3.4.1)
httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
jsonlite 1.5 2017-06-01 CRAN (R 3.4.1)
knitr 1.17 2017-08-10 CRAN (R 3.4.1)
lattice 0.20-35 2017-03-25 CRAN (R 3.4.1)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.1)
magrittr 1.5 2014-11-22 CRAN (R 3.4.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.1)
methods * 3.4.1 2017-09-07 local
mime 0.5 2016-07-07 CRAN (R 3.4.1)
mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1)
nlme 3.1-131 2017-02-06 CRAN (R 3.4.1)
nycflights13 * 0.2.2 2017-01-27 CRAN (R 3.4.1)
openssl 0.9.9 2017-11-10 cran (@0.9.9)
parallel 3.4.1 2017-09-07 local
pillar 1.0.1 2017-11-27 cran (@1.0.1)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.1)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.1)
psych 1.7.8 2017-09-09 CRAN (R 3.4.1)
purrr 0.2.4 2017-10-18 CRAN (R 3.4.1)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
Rcpp 0.12.14 2017-11-23 cran (@0.12.14)
reprex * 0.1.1 2017-01-13 CRAN (R 3.4.1)
reshape2 1.4.3 2017-12-11 cran (@1.4.3)
rlang 0.1.6 2017-12-21 cran (@0.1.6)
rmarkdown 1.7 2017-11-10 CRAN (R 3.4.1)
rprojroot 1.3-2 2018-01-03 cran (@1.3-2)
rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
shiny 1.0.5 2017-08-23 CRAN (R 3.4.1)
sparklyr * 0.7.0-9106 2018-01-08 Github (rstudio/sparklyr@41d145a)
stats * 3.4.1 2017-09-07 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.1)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.1)
tibble 1.4.1 2017-12-25 cran (@1.4.1)
tidyr 0.7.2 2017-10-16 CRAN (R 3.4.1)
tools 3.4.1 2017-09-07 local
utf8 1.1.3 2018-01-03 cran (@1.1.3)
utils * 3.4.1 2017-09-07 local
whisker 0.3-2 2013-04-28 CRAN (R 3.4.1)
withr 2.1.1 2017-12-19 cran (@2.1.1)
xtable 1.8-2 2016-02-05 CRAN (R 3.4.1)
yaml 2.1.14 2016-11-12 CRAN (R 3.4.1)
I'm not sure if this is related to a similar-sounding issue with mutate / rename that was resolved in dbplyr 1.2.0. I can work around it for now, but it will get clunky fast as I get into more complex aggregation operations with sparklyr
.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3295.
@jnolis commented on Jan 16, 2018, 10:22 PM UTC:
Currently, when converting from R to a SQL Server query, d[b]plyr incorrectly handles "as.integer64" This is the function I would expect to use if I want to cast as a bigint within SQL Server. What ends up happening is dplyr uses "AS.INTEGER64" as a SQL Command
CREATE TABLE Example (Number int)
tbl(con,"Example") %>%
mutate(Number64 = as.integer64(Number))
Error: <SQL> 'SELECT TOP 10 "Number", AS.INTEGER64("Number") AS "Number64"
FROM "Example"'
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Incorrect syntax near the keyword 'AS'.
Which wouldn't happen if the SQL Query was the correct:
SELECT TOP 10 Number, CAST(Number AS BIGINT) AS NUMBER64
FROM EXAMPLE
This issue was moved by krlmlr from tidyverse/dplyr/issues/3305.
@mkirzon commented on Apr 5, 2018, 11:38 PM UTC:
When using dplyr directly against databases, it'd be helpful to have a NOLOCK options when configuring table connections.
This is a common practice in MSSQL to ensure complex queries don't lock the database for other read/writes. The tradeoff is that the read may be uncommited so this should be a toggleable option.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3484.
@CerebralMastication commented on Apr 15, 2018, 1:56 PM UTC:
This is a cross post from the RStats Community, where I initially posted it: https://community.rstudio.com/t/dplyr-arrange-by-group-true-fails-with-sql-backend/7232
I've either got a misunderstanding or a bug... I think it's a bug.
It seems that dplyr::arrange()
with the .by_group=TRUE
parameter set produces SQL with an error. Here's how to reprex it:
On the DB (Redshift in my case) set up a dummy table:
drop TABLE sandbox.testorder;
CREATE TABLE sandbox.testorder (
grp varchar(255),
n DOUBLE PRECISION
);
INSERT INTO sandbox.testorder (grp , n) VALUES ('a',3.3);
INSERT INTO sandbox.testorder (grp , n) VALUES ('a',1.1);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',2.2);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',4.4);
Then from R
(presuming a connection to the DB called con
and already loaded dbplyr
testorder <- tbl(con, "testorder")
testorder %>%
group_by( grp ) %>%
arrange( n, .by_group=TRUE) ->
out_test
show_query(out_test)
which generates the following SQL:
SELECT *
FROM "testorder"
ORDER BY "n", TRUE
which fails if I try to collect(out_test)
with the following error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: non-integer constant in ORDER BY
the rub seems to be the ,TRUE
there at the end. If I remove it, I get the followable runnable SQL:
SELECT *
FROM "testorder"
ORDER BY "n"
My guess is that the routine that generates the SQL has a glitch. Looks like it's just passing , TRUE
instead of adding in the group by
variables.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3515.
@hadley commented on Oct 23, 2017, 10:12 PM UTC:
Need to verify which connection methods actually get used (I think just escaping, and the rest is just S3 dispatch) and then provide some lightweight way for users to select their own translation layer.
Maybe something like:
con <- custom_translation(jdbcConnection, "Oracle")
with arguments to override default quoting for identifiers and strings?
All methods would need to redispatch to wrapped object.
This issue was moved by hadley from tidyverse/dplyr/issues/3160.
calling compute
on a teradata connection does not work and throws the following error:
Error in new_result(connection@ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Teradata][ODBC Teradata Driver][Teradata Database](-3707)Syntax error, expected something like a 'METHOD' keyword between the 'CREATE' keyword and the 'TEMPORARY' keyword.
I think you need to use the keyword VOLATILE
instead of TEMPORARY
for Teradata right?
@geotheory commented on Aug 28, 2018, 12:02 PM UTC:
I've detailed this question on this SO page.
I can't figure out how to apply a non-case-sensitive filter query to a remote PostgreSQL table using dplyr
. To demonstrate:
require(dplyr)
require(stringr)
require(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="mydb", host="localhost", port=5432, user="username")
# create db table
copy_to(con, iris, "iris", temporary = FALSE)
# dplyr remote database table
iris_pg <- tbl(con, "iris")
iris_pg %>% filter(str_detect(Species, 'setosa')) %>% head(3) %>% collect()
# A tibble: 3 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
* <dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
iris_pg %>% filter(str_detect(Species, 'Setosa')) %>% head(3) %>% collect()
# A tibble: 0 x 0
To ignore case stringr::fixed('Setosa', ignore_case=TRUE)
works with tibble filtering. But with the postgres table it has no effect:
iris_pg %>% filter(str_detect(Species, stringr::fixed('SETOSA', ignore_case=TRUE))) %>% head(3) %>% collect()
# A tibble: 0 x 0
Is there any way around this, or possibly dev plans to accommodate the ignore_case
arguments or fixed()
or regex()
for database table filtering?
This issue was moved by romainfrancois from tidyverse/dplyr#3783.
@sirallen commented on May 25, 2018, 4:46 PM UTC:
Dplyr database query appears to fail when the same window function is called twice
The following query is meant to find up to the first four unique destinations (by calendar date/time) for each carrier, in the nycflights13::flights
dataset:
library(dbplyr)
library(dplyr)
library(nycflights13)
localdb <- src_postgres(dbname = 'dbname',
host = 'localhost',
port = 5432,
user = 'user',
password = 'password')
copy_to(localdb, flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
# Has row_number() twice
incorrect <- tbl(localdb, 'flights') %>%
group_by(carrier, dest) %>%
arrange(carrier, dest, time_hour) %>%
filter(row_number() == 1L) %>%
group_by(carrier) %>%
arrange(carrier, time_hour) %>%
filter(row_number() < 5L) %>%
select(carrier, dest, time_hour) %>%
collect()
head(incorrect, 10)
# A tibble: 10 x 3
# Groups: carrier [4]
# carrier dest time_hour
# <chr> <chr> <dttm>
# 1 9E ATL 2013-01-03 22:00:00
# 2 9E AUS 2013-02-01 08:00:00
# 3 9E AVL 2013-04-14 02:00:00
# 4 9E BGR 2013-10-17 14:00:00
# 5 AA AUS 2013-01-01 09:00:00
# 6 AA BOS 2013-01-01 00:00:00
# 7 AA DFW 2012-12-31 22:00:00
# 8 AA EGE 2013-01-01 09:00:00
# 9 AS SEA 2012-12-31 23:00:00
#10 B6 ABQ 2013-04-22 09:00:00
# Has row_number() and min_rank() -- selects ties as well, so not exactly what is intended,
# but the results are ordered correctly compared to above
correct <- tbl(localdb, 'flights') %>%
group_by(carrier, dest) %>%
arrange(carrier, dest, time_hour) %>%
filter(row_number() == 1L) %>%
group_by(carrier) %>%
arrange(carrier, time_hour) %>%
filter(min_rank(time_hour) < 5L) %>%
select(carrier, dest, time_hour) %>%
collect()
head(correct, 10)
# A tibble: 10 x 3
# Groups: carrier [2]
# carrier dest time_hour
# <chr> <chr> <dttm>
# 1 9E MSP 2013-01-01 00:00:00
# 2 9E BUF 2013-01-01 06:00:00
# 3 9E ROC 2013-01-01 07:00:00
# 4 9E IAD 2013-01-01 07:00:00
# 5 9E SYR 2013-01-01 07:00:00
# 6 9E BWI 2013-01-01 07:00:00
# 7 9E ORD 2013-01-01 07:00:00
# 8 9E IND 2013-01-01 07:00:00
# 9 AA MIA 2012-12-31 21:00:00
#10 AA SJU 2012-12-31 22:00:00
This issue was moved by krlmlr from tidyverse/dplyr/issues/3602.
add support for tidyr::fill
@EdwardJRoss commented on Jan 17, 2018, 5:10 AM UTC:
I am trying to send a table to an Oracle Database via ODBC ,overwriting if it already exists.
The copy_to method in Oracle via ODBC fails when overwrite=TRUE because "DROP TABLE IF EXISTS" isn't valid syntax in Oracle.
Example (without specific credentials), tested with Oracle 11.2:
con <- DBI::dbConnect(odbc::odbc(), ...)
dplyr::copy_to(con, iris, overwrite=TRUE)
#> Error: <SQL> 'DROP TABLE IF EXISTS iris'
#> nanodbc/nanodbc.cpp:1587: HY000: [Oracle][ODBC][Ora]ORA-00933: SQL command not properly ended
dplyr::copy_to(con, iris, overwrite=TRUE)
#> Error in connection_begin(conn@ptr) : Double begin
Expected result:
The table iris is loaded to the database, being overwritten if it already exists.
Actual Result:
There is an error, and the connection can no longer be used (it seems the transaction isn't aborted on failure).
I also tried using ROracle but copy_to seems to fail when attempting to start a transaction:
con <- DBI::dbConnect(DBI::dbDriver("Oracle"), ...)
dplyr::copy_to(con, iris)
#> Error in (function (classes, fdef, mtable) :
#> unable to find an inherited method for function 'dbBegin' for signature '"OraConnection"'
This issue was moved by krlmlr from tidyverse/dplyr/issues/3306.
@cderv commented on Apr 18, 2018, 5:20 PM UTC:
I work with an oracle connection so difficult to make a reprex sorry. I just will use the code as-is as I think it is enough to see the issue in SQL translation.
When using dplyr on an oracle connection, I had this error
tu_cc_pt %>%
mutate(niv_u = case_when(
PT_NIVEAU_TENSION == "90 KV" ~ "HT",
PT_NIVEAU_TENSION == "63 KV" ~ "HT",
TRUE ~ PT_NIVEAU_TENSION)) %>%
collect()
The error message is
Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: ORA-00920: invalid relational operator
A call to show_query
instead of collect
gives the following SQL
SELECT "ID_POSTE_TECHNIQUE", "PT_IDR", "PT_CUR", "PT_TYPE", "PT_STATUT_SEQ", "PT_STATUT_NON_SEQ", "PT_NIVEAU_TENSION", "PT_FILERIE_LIBELLE", "PT_TECHNOLOGIE", "PT_PLAN_PROTECTION", "PT_DT_PREM_MISE_SS_TENSION", "PT_CM", "PT_GMR", CASE
WHEN ("PT_NIVEAU_TENSION" = '90 KV') THEN ('HT')
WHEN ("PT_NIVEAU_TENSION" = '63 KV') THEN ('HT')
WHEN (TRUE) THEN ("PT_NIVEAU_TENSION")
END AS "niv_u"
FROM (DMS_SIDONI_DL_RO.DMBC_CC_PT)
This as been mentionned in #2928 but not fixed it seems.
I think the problem is with WHEN (TRUE)
that is not valid with ORACLE. It should be a ELSE
SELECT "ID_POSTE_TECHNIQUE", "PT_IDR", "PT_CUR", "PT_TYPE", "PT_STATUT_SEQ", "PT_STATUT_NON_SEQ", "PT_NIVEAU_TENSION", "PT_FILERIE_LIBELLE", "PT_TECHNOLOGIE", "PT_PLAN_PROTECTION", "PT_DT_PREM_MISE_SS_TENSION", "PT_CM", "PT_GMR", CASE
WHEN ("PT_NIVEAU_TENSION" = '90 KV') THEN ('HT')
WHEN ("PT_NIVEAU_TENSION" = '63 KV') THEN ('HT')
ELSE ("PT_NIVEAU_TENSION")
END AS "niv_u"
FROM (DMS_SIDONI_DL_RO.DMBC_CC_PT)
What do you think ?
If you need another simpler example, please tell me I will work on one.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3521.
@CarolineBarret commented on Aug 2, 2018, 1:14 PM UTC:
I am working with R 3.4.3 and dplyr 0.7.4.
I am trying to apply the case_when()
function to a tibble object from a database. But when I combine the case_when()
function to the mutate()
function, I get an error:
dir.create("tmp")
# The datasets.sqlite is just a sqlite database
# It contains a table called iris with the iris data frame in it
file.copy("data/datasets.sqlite", "tmp")
# Connect to the database
con <- DBI::dbConnect(RSQLite::SQLite(), "tmp/datasets.sqlite")
cake <- dplyr::mutate(dplyr::tbl(con, "iris"),
type = dplyr::case_when(
Sepal.Length > 5.8 | Petal.Length > 3.7 ~ "long",
Sepal.Width > 3 | Petal.Width > 1.2 ~ "wide",
TRUE ~ "other"
)
)
This gives me the following error: Error in eval_bare(f[[2]], env) : object 'Sepal.Length' not found
I think this is to do with the fact that I use mutate()
because I do not get an error with the following:
ir <- dplyr::tbl(con, "iris")
cake <- dplyr::case_when(
ir$Sepal.Length > 5.8 | ir$Petal.Length > 3.7 ~ "long",
ir$Sepal.Width > 3 | ir$Petal.Width > 1.2 ~ "wide",
TRUE ~ "other"
)
It also is to do with tibbles output (and not with data.frames), because this works fine too:
cake <- dplyr::mutate(DBI::dbReadTable(con, "iris"),
type = dplyr::case_when(
Sepal.Length > 5.8 | Petal.Length > 3.7 ~ "long",
Sepal.Width > 3 | Petal.Width > 1.2 ~ "wide",
TRUE ~ "other"
)
)
This issue was moved by krlmlr from tidyverse/dplyr#3736.
@edoardomichielon commented on Jun 20, 2018, 11:10 AM UTC:
When I use a connection to a sqlite on disk, the verb n_distinct() returns an error if there are two or more columns. Same code, if I select just one column or collect data before counting record, it works properly.
# remove objects
rm(list = ls())
# require packages
require(dplyr)
require(dbplyr)
require(RSQLite)
# create a sqlite db and connect to it
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# copy data into sqlite
copy_to(con, mtcars)
# point to table
my_tbl <- tbl(con, "mtcars")
# n_distinct with one column (THIS WORKS)
my_tbl %>% group_by(gear) %>% summarise(n_distinct(mpg))
# if I use the local (or collected) data it is ok (THIS WORKS)
my_tbl %>% collect() %>% group_by(gear) %>% summarise(n_distinct(mpg, cyl))
# n_distinct with two columns (THIS DOES NOT WORKS)
my_tbl %>% group_by(gear) %>% summarise(n_distinct(mpg, cyl))
## Error in result_create(conn@ptr, statement) :
## wrong number of arguments to function COUNT()
The code is correctly translated into Sql
# Show query
my_tbl %>% group_by(gear) %>% summarise(n_distinct(mpg, cyl)) %>% show_query()
## SELECT `gear`, COUNT(DISTINCT `mpg`, `cyl`) AS `n_distinct(mpg, cyl)`
## FROM `mtcars`
## GROUP BY `gear`
This issue was moved by batpigandme from tidyverse/dplyr/issues/3687.
I understood there has been a choice to now have dbplyr issues in there own repo. (change in c98ff1f)
There still link in the DESCRIPTION file to report bug in dplyr.
Line 16 in 79d2a03
Also the pkgdown site is not up-to-date, but has the README has been modified, this should be ok next build. I do not know your timeframe for updating the website.
I've been trying to join two tables with have slightly different variable names for instance start_date and day_start_date
I've been trying to write something like:
inner_join(calendar.dim, by = c("start_date" = "day_start_date"))
I have googled quite a bit and read through the documentation and can't seem to find where this exists.
@geotheory commented on Aug 28, 2018, 12:02 PM UTC:
I've detailed this question on this SO page.
I can't figure out how to apply a non-case-sensitive filter query to a remote PostgreSQL table using dplyr
. To demonstrate:
require(dplyr)
require(stringr)
require(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="mydb", host="localhost", port=5432, user="username")
# create db table
copy_to(con, iris, "iris", temporary = FALSE)
# dplyr remote database table
iris_pg <- tbl(con, "iris")
iris_pg %>% filter(str_detect(Species, 'setosa')) %>% head(3) %>% collect()
# A tibble: 3 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
* <dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
iris_pg %>% filter(str_detect(Species, 'Setosa')) %>% head(3) %>% collect()
# A tibble: 0 x 0
To ignore case stringr::fixed('Setosa', ignore_case=TRUE)
works with tibble filtering. But with the postgres table it has no effect:
iris_pg %>% filter(str_detect(Species, stringr::fixed('SETOSA', ignore_case=TRUE))) %>% head(3) %>% collect()
# A tibble: 0 x 0
Is there any way around this, or possibly dev plans to accommodate the ignore_case
arguments or fixed()
or regex()
for database table filtering?
This issue was moved by romainfrancois from tidyverse/dplyr#3783.
Here is a reprex using SQLite's date
command:
library(odbc)
library(DBI)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dates_tbl <- copy_to(con, mtcars)
dates_tbl %>%
mutate(b = date(mpg))
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.22.0 [:memory:]
#> mpg cyl disp hp drat wt qsec vs am gear carb b
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 -471~
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 -471~
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 -471~
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 -471~
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 -471~
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 -471~
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 -471~
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 -471~
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 -471~
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 -471~
#> # ... with more rows
dates_tbl %>%
select(b = date(mpg))
#> Error in date(mpg): unused argument (mpg)
dbDisconnect(con)
Created on 2018-08-20 by the reprex package (v0.2.0).
@colearendt commented on Jul 20, 2018, 10:33 AM UTC:
Definition of n_distinct
:
function (..., na.rm = FALSE)
{
n_distinct_multi(list(...), na.rm)
}
<environment: namespace:dplyr>
And the translation:
n_distinct = function(...) {
vars <- sql_vector(list(...), parens = FALSE, collapse = ", ")
build_sql("COUNT(DISTINCT ", vars, ")")
}
Lines 199 to 203 in 79d2a03
Lines 272 to 275 in 79d2a03
Which results in:
pd %>% summarise(n_distinct(fld_integer, na.rm = TRUE)) %>% show_query()
#<SQL>
#SELECT COUNT(DISTINCT "fld_integer", TRUE AS "na.rm") AS "n_distinct(fld_integer, na.rm = TRUE)"
#FROM "testdata"
na.rm
should not be included in the splicing.
This issue was moved by krlmlr from tidyverse/dplyr#3720.
According to the developers, it mimics PostgresSQL.
https://www.sqlite.org/draft/windowfunctions.html
https://www.sqlite.org/draft/releaselog/3_25_0.html
Would also need to update the vignette section, "Creating your own database", as it mentions the current lack of window function support.
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
The first
function in dbplyr returns whatever expression it's given as a string, instead of returning the first value (which should respect previous group_by()
and arrange()
.
It was discovered in this Stack Overflow question. That table (and the version I reproduced locally) is PostgreSQL, but the problem can be reproduced in SQLlite:
library(dplyr)
batting <- tbl(dbplyr::lahman_sqlite(), "Batting")
tbl(dbplyr::lahman_sqlite(), "Batting") %>%
arrange(yearID) %>%
group_by(playerID) %>%
summarize(first = first(G))
Returns:
# Source: lazy query [?? x 2]
# Database: sqlite 3.22.0 [/var/folders/8p/xzrrqphx2qb3d2s_fgqrk5xr0000gn/T//Rtmpe0JeOa/lahman.sqlite]
playerID first
<chr> <chr>
1 aardsda01 `G`
2 aaronha01 `G`
3 aaronto01 `G`
4 aasedo01 `G`
5 abadan01 `G`
6 abadfe01 `G`
7 abadijo01 `G`
8 abbated01 `G`
9 abbeybe01 `G`
10 abbeych01 `G`
# ... with more rows
This matches with the result of show_query()
on the above.
<SQL>
SELECT `playerID`, '`G`' AS `first`
FROM (SELECT *
FROM `Batting`
ORDER BY `yearID`)
GROUP BY `playerID`
It has a similar behavior:
first(blabla)
, or another expression that's not a columngroup_by()
(it can thus be reproduced with just tbl(dbplyr::lahman_sqlite(), "Batting") %>% summarize(first(G))
)Hi,
When I'm using DBI::dbExecute(con, statement) to send create view /create table statement to Teradata, and then do a show view query to pull the statement. The text consistently got messed up by missing the \n. I have tested using other IDE to send identical .sql file to database and the \n was kept fine.
DBI * 1.0.0 2018-05-02 cran (@1.0.0)
dbplyr 1.2.1 2018-02-19 cran (@1.2.1)
Teradata Version: 15.10.0711 15.10.07.11B
@janusvm commented on Jun 25, 2018, 8:26 AM UTC:
For an SQL backend, using desc()
or between()
inside an arrange
or filter
, respectively, returns an error if the functions are qualified.
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
lf <- dbplyr::lazy_frame(x = 1, y = 1,
src = dbplyr::simulate_dbi())
# Works
lf %>%
arrange(desc(y)) %>%
show_query()
#> <SQL> SELECT *
#> FROM "df"
#> ORDER BY "y" DESC
lf %>%
filter(between(y, 0, 1)) %>%
show_query()
#> <SQL> SELECT *
#> FROM "df"
#> WHERE ("y" BETWEEN 0.0 AND 1.0)
# Doesn't work
lf %>%
dplyr::arrange(dplyr::desc(y)) %>%
dplyr::show_query()
#> Error in dplyr::desc(y): object 'y' not found
lf %>%
dplyr::filter(dplyr::between(y, 0, 1)) %>%
dplyr::show_query()
#> Error in dplyr::between(y, 0, 1): object 'y' not found
This issue was moved by batpigandme from tidyverse/dplyr/issues/3696.
Casting to string in BigQuery uses CAST(column_name as STRING). When R translates as.character(column_name) it writes it as CAST(column_name as TEXT) which fails.
> library(dbplyr)
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dbplyr_1.2.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 dplyr_0.7.6 assertthat_0.2.0 R6_2.2.2
[5] DBI_1.0.0 magrittr_1.5 pillar_1.2.3 rlang_0.2.2
[9] bindrcpp_0.2.2 tools_3.4.4 glue_1.3.0 purrr_0.2.5
[13] yaml_2.1.19 compiler_3.4.4 pkgconfig_2.0.2 bindr_0.1.1
[17] tidyselect_0.2.4 tibble_1.4.2
Example:
project <- "project-name"
billing <- "billing-name"
con <- dbConnect(
bigrquery::bigquery(),
project = project,
billing = billing
)
# a date dimensional table
date_dim <- tbl(con, "project.schema.table")
## illustrative of date_dim
#date_dim <- data.frame(date_key = c(20180101:20180131), fiscal_month_name = rep("JAN", 31))
date_dim %>%
select(DATE_KEY, FISCAL_MONTH_NAME) %>%
mutate(date_key_str = as.character(DATE_KEY)) %>%
show_query()
<SQL>
SELECT `DATE_KEY`, `FISCAL_MONTH_NAME`, CAST(`DATE_KEY` AS TEXT) AS `date_key_str`
FROM (SELECT `DATE_KEY`, `FISCAL_MONTH_NAME`
FROM `project.schema.table`) `vnzsigtmhn`
date_dim %>%
select(DATE_KEY, FISCAL_MONTH_NAME) %>%
mutate(date_key_str = as.character(DATE_KEY)) %>%
collect()
Attempting to collect then produces this error:
Error: Type not found: TEXT at [2:66] [invalidQuery]
Unfortunately hard-coding the CAST into the mutate does not work either.
> date_dim %>%
+ select(DATE_KEY, FISCAL_MONTH_NAME) %>%
+ mutate(date_key_str = CAST(DATE_KEY as STRING))
Error: unexpected symbol in:
" select(DATE_KEY, FISCAL_MONTH_NAME) %>%
mutate(date_key_str = CAST(DATE_KEY as"
EDIT: Original script was running on an old version of R, I updated and the issue still persists.
@hadley commented on May 20, 2018, 1:44 PM UTC:
Show how to use lazy_frame()
plus simulate_xyz()
to generate SQL translation reprex
This issue was moved by krlmlr from tidyverse/dplyr/issues/3585.
@foundinblank commented on Apr 25, 2018, 12:28 PM UTC:
Hi,
I'm unable to pull views via my RMariaDB connection. Here's an example below where I'm trying to pull a view called video_product_lookup
, but I can replicate the issue with all other views. I can pull regular tables just fine.
Thanks,
Adam
# Load packages
library(RMariaDB)
#> Warning: package 'RMariaDB' was built under R version 3.4.4
library(tidyverse)
library(DBI)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
# Connect to database
reporting <- dbConnect(RMariaDB::MariaDB(), group = "reporting")
dbListTables(reporting) # This does list the table (view) I want to connect to
#> ...
#> [67] "video_product_lookup"
#> ...
# Get the view
df <- tbl(reporting, "video_product_lookup")
#> Error in result_create(conn@ptr, statement, is_statement): Prepared statement needs to be re-prepared [1615]
# Get the view a different way
df <- dbReadTable(reporting, "video_product_lookup")
#> Error in result_create(conn@ptr, statement, is_statement): Prepared statement needs to be re-prepared [1615]
# Get a regular table - this works fine.
df <- tbl(reporting, "biz_orders")
Created on 2018-04-25 by the reprex package (v0.2.0).
This issue was moved by batpigandme from tidyverse/dplyr/issues/3541.
@edgararuiz commented on May 24, 2018, 4:00 PM UTC:
Adding support in the dbplyr::simulate_###()
functions of the new src/tbl class that is being passed in the dev version of dbplyr
will allow us to adds test to the customizations (ref: tidyverse/dbplyr@81419ca)
Using a PR's version of dbplyr
, this should be using MINUS
instead of EXCEPT
:
library(dplyr, warn.conflicts = FALSE)
packageVersion("dbplyr")
#> [1] '1.2.1.9001'
lf1 <- dbplyr::lazy_frame(x = 1:4, src = dbplyr::simulate_oracle())
lf2 <- dbplyr::lazy_frame(x = 1:4, src = dbplyr::simulate_oracle())
lf1 %>% setdiff(lf2) %>% show_query()
#> <SQL> (SELECT *
#> FROM `df`)
#> EXCEPT
#> (SELECT *
#> FROM `df`)
This issue was moved by krlmlr from tidyverse/dplyr/issues/3598.
@tslumley commented on Apr 27, 2018, 1:46 AM UTC:
When summarise_at
uses a variable in .vars
that it also uses in .funs
, the result is different in dplyr and dbplyr, and I think the dbplyr result is the right one -- though there's a case for just making it an error.
library(dplyr)
library(MonetDBLite)
library(DBI)
mydb <- dbConnect(MonetDBLite::MonetDBLite())
DBI::dbWriteTable(mydb, "mtcarsdb", mtcars)
mtcars.db = tbl(mydb, "mtcarsdb")
## wrong
mtcars %>% summarise_at(vars(cyl,disp,hp,drat), funs(sum(. * cyl * mpg)))
## right
mtcars.db %>% summarise_at(vars(cyl,disp,hp,drat), funs(sum(. * cyl * mpg)))
## also right
mtcars %>% mutate(cyl1=cyl) %>% summarise_at(vars(cyl,disp,hp,drat), funs(sum(. * cyl1 * mpg)))
# intended result
with(mtcars, cbind(cyl * cyl * mpg,
disp * cyl * mpg,
hp * cyl * mpg,
drat * cyl * mpg) %>% colSums())
It looks as though in R the variable cyl
is updated before it is used for the 2nd through 4th sums, but in SQL it isn't.
This issue was moved by krlmlr from tidyverse/dplyr/issues/3547.
@simon-anasta commented on Aug 27, 2018, 2:14 AM UTC:
Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.
Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
When writing queries using dbplyr
for SQL server, ifelse
and case_when
produce different SQL code. The SQL code from case_when
results in an exception. See also this question
Sorry the below is not a perfect reprex. Security restrictions prevent me from including business specific details in my example.
# libraries
library(DBI)
library(dplyr)
library(dbplyr)
# establish connection to database table
connection_string = "database.specific.string"
# mine looks something like "DRIVER=...; Trusted_Connection=...; DATABASE=...' SERVER=..."
db_connection = dbConnect(odbc::odbc(), .connection_string = connection_string)
my_table = tbl(db_connection, from = my_table_name)
# attempted query
tmp = my_table %>%
mutate(new_col = case_when(col1 == col2 ~ "a",
TRUE ~ "b"))
# check SQL code for query
show_query(tmp)
# compared to
tmp = my_table %>%
mutate(new_col = ifelse(col1 == col2, "a", "b"))
# check SQL code for query
show_query(tmp)
The resulting SQL query from the case_when
command takes the form:
SELECT
col1, col2,
CASE
WHEN CONVERT(BIT, IIF(col1 = col2, 1.0, 0.0))) THEN ('a')
WHEN (TRUE) THEN ('b')
END AS new_col
FROM my_database.my_table_name
This produces an exception: "An expression of non-boolean type specified in a context where a condition is expected, near 'THEN'"
The resulting SQL query from the ifelse
command takes the form:
SELECT
col1, col2,
CASE
WHEN (CONVERT(BIT, IIF(col1 = col2, 1.0, 0.0)))) = TRUE THEN ('a')
WHEN (CONVERT(BIT, IIF(col1 = col2, 1.0, 0.0)))) = FALSE THEN ('b')
END AS new_col
FROM my_database.my_table_name
This works as expected.
As best I can determine, this is a bug. I suspect the SQL syntax from case_when
should be more similar to the SQL syntax from ifelse
. In particular I suspect the additional brackets and the "= TRUE" in the ifelse
version should also appear in the case_when
version.
This issue was moved by romainfrancois from tidyverse/dplyr#3774.
'::' or ':::' import not declared from: ‘withr’
@emilyriederer commented on Aug 27, 2017, 7:16 PM UTC:
Base R integer division (%/%) is translated to normal division (/) in SQL.
Apologies for non-reprex. I couldn't think of a way to get around the fact that a user must establish a connection to get this to run. When run connected to a AWS Redshift DB, data$z contains values 0, 1, and 2 while data_db$z contains floating point values.
The cause is that %/% is translated as / without accounting for integer-format. One correct SQL translation (at least for Redshift) would be CAST("x"/5 AS INTEGER) or FLOOR("x"/5)
library(dplyr)
data <- data.frame(x=as.integer(1:10))
data_db <- copy_to(con, data, "data_db", temporary = FALSE)
data <- mutate(data, z = x %/% 5)
data_db <- mutate(data_db, z = x %/% 5)
data
collect(data_db)
show_query(data_db)
This issue was moved by krlmlr from tidyverse/dplyr/issues/3057.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.