rdatatable / data.table Goto Github PK

View Code? Open in Web Editor NEW

3.5K 3.5K 968.0 51.68 MB

R's data.table package extends data.frame:

Home Page: http://r-datatable.com

License: Mozilla Public License 2.0

R 63.32% C 35.27% Batchfile 1.21% Shell 0.08% Makefile 0.03% C++ 0.04% CSS 0.01% Dockerfile 0.04%

data.table's People

Contributors

Stargazers

Watchers

Forkers

kaybenleroll sveinbjornyngvi tavpritesh ranaivosonherimanitra huashan yvemeng pachoalvarez rowf dryangesq skame ivanliu1989 stefanfritsch thatsplenty ajlr ermi-bro jeffreyhorner kbroman linearregression dongl welch16 a6111e randomeffect zlfccnu anthonyrr davidhhshao 23data vanessaluong mreeddev dbbevan fw1121 akino1976 drjon zedleb wolfmith jigucci wwdxfa jiahaobo jangorecki ddtwu alexllewellyn scipionesarlo vlulla thsiung suensummit fxcebx fcocquemas joshbrowning2358 observatorioticec paolovaona vasanthgx toma-l reggie19500722 eemaa26 halcyonhui dmashkov anna-rkumar elsaxd jokbull basmanasser m3cinc hlc123xyz lcteo wsenanan ramonmillan jasminkak lucyblue20 feyoung neel17 jessmiramontes msi888 emfs apys lglima2015 dmitten msanfeliu victorus wercuk salonievyas seiha1 nitroniko manazevedof hevi2001 alexrodnyy asmalllemon bquast sakshi11 cqincoding maexle90 rodserling anakcampelo shineqian reezbo3k longlingmichael timofeyuk dripdrop12 m00nd00r gregoryslondon falko-r sbhuyan7 hanssi7941

data.table's Issues

[R-Forge #5442] Possible wrong assignment to list columns

Submitted by: Michele Carriero; Assigned to: Arun ; R-Forge link

Hello, please consider the following two assignments to an element of a list column, which I think should not give the same result:

library(data.table)
dt <- data.table(id=1:2, comment=vector(mode="list", length=2))

dt[1L, comment := 1]
a<-dt$comment
dt[1L, comment := list(list(1))]
b<-dt$comment  

> identical(a, b)
# [1] TRUE

This possible bug came up after a Arun's answer on SO:

[R-Forge #5380] error with sum(.SD) and .SDcols

Submitted by: Jonathan Owen; Assigned to: Nobody; R-Forge link

I believe this didn't generate an error with earlier revisions. I'm using 1.8.11, r1158.

dt=data.table(id = 1:10, val1=10:1, val2=10:1)
dt[, sum(.SD), .SDcols="val1", by=id]
Error in gsum(.SD) : grpn [10] != length(x) [0] in gsum
5 gsum(.SD)
4 eval(expr, envir, enclos)
3 eval(jsub, thisEnv)
2 [.data.table(dt, , sum(.SD), .SDcols = "val1", by = id)
1 dt[, sum(.SD), .SDcols = "val1", by = id]
dt[, sum(.SD[, "val1", with=FALSE]), by=id]
id V1
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
6: 6 5
7: 7 4
8: 8 3
9: 9 2
10: 10 1

[R-Forge #5415] .BY seems to be empty for version 1.9.2 (incluting latest 1.9.3 release)

Submitted by: Stephane Vernede; Assigned to: Arun ; R-Forge link

From version 1.9.2 (incluting latest 1.9.3 release) .BY is empty for all cases I have tested.

Consider

A<-data.table(fruit=c("apple","peach","pear"))

test<-function(x){
paste(x," ")
}

A[,test(.BY$fruit),by=c("fruit")]

returns

fruit V1
1: apple
2: peach
3: pear

Whereas

A[,test(fruit),by=c("fruit")]

returns the right answer

fruit V1
1: apple apple
2: peach peach
3: pear pear

My platform is R 3.0.2 on Win 7

[R-Forge #5377] data.table(data.frame()) creates an empty V1 column, should be 0 columns

Submitted by: shubh bansal; Assigned to: Arun ; R-Forge link

data.frame()
data frame with 0 columns and 0 rows

data.table(data.frame())
Empty data.table (0 rows) of 1 col: V1

Ideally the data.table should have 0 columns

data.table header (.h) file

Organise current src/ by putting together a data.table.h header file.

[R-Forge #5714] DT[, j, with=FALSE] errors when j is of length 0 - regression in 1.9.3

Submitted by: Arun ; Assigned to: Arun ; R-Forge link

This is a regression due to recent changes to [.data.table, I think. Basically,

data.table(1:10)[, -1L, with=FALSE] should return null data.table, but instead errors with lhs not found - it's in the wrong place (much further than where it should've already stopped).

[R-Forge #5672] error when merging zero row data.table

Submitted by: Garrett See; Assigned to: Arun ; R-Forge link

The following did not give an error in 1.9.2, but does in 1.9.3. Using svn revision 1263:

library(data.table)
# data.table 1.9.3  For help type: help("data.table")
a <- data.table(BOD, key="Time")
b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table
merge(a,b, all=TRUE) # works fine
#    Time demand.x demand.y
#1:    1      8.3       NA
#2:    2     10.3       NA
#3:    3     19.0       NA
#4:    4     16.0       NA
#5:    5     15.6       NA
#6:    7     19.8       NA
merge(b,a, all=TRUE) # error
# Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
#   neworder is length 2 but x has 3 columns.

Originally reported on the mailing list

[R-Forge #5421] Update operation with row suppression crashes R

Submitted by: William Constantine; Assigned to: Nobody; R-Forge link

Want to crash your R session? Then read on …

In R, negative indexing is often used to get rid of the corresponding elements/rows in a vector/data frame. With data table objects, however, you have to be careful:

library(data.table)
data.table 1.8.10 For help type: help("data.table")
DT <- data.table(x = 1:5, y = 5:1)
DT
x y
1: 1 5
2: 2 4
3: 3 3
4: 4 2
5: 5 1
ix <- c(1L, 3L, 5L)
DT[ix]
x y
1: 1 5
2: 3 3
3: 5 1

remove the rows corresponding to ix

conversely, keep rows not specified by ix

DT[-ix]
x y
1: 2 4
2: 4 2

The trouble starts when you want to create a new

column z that contains values only for those rows

not corresponding to ix

DON’T DO THIS: IT SOMEHOW CONTAMINATES THE UNDERLYING STRUCTURE

OF THE DATA TABLE OBJECT AND A SUBSEQUENT PRINTING OF THAT OBJECT (ONCE OR TWICE)

WILL (LIKELY) CAUSE R TO CRASH!

DT[-ix, z := 5L]
DT # prepare for meltdown now …
DT # … or now

Instead, use the ! sign, which signifies the NOT condition

DT[!ix, z:= 5L]
DT
x y z
1: 1 5 NA
2: 2 4 5
3: 3 3 NA
4: 4 2 5
5: 5 1 NA

While I understand the use of ! is documented and preferred, it shouldn't bring down R if someone mistakenly uses minus (-) instead! Here are additional specs:

Thanks!

William Constantine
[email protected]

Package: data.table
Version: 1.8.10
Maintainer: Matthew Dowle [email protected]
Built: R 3.0.2; x86_64-w64-mingw32; 2013-10-07 15:31:26 UTC; windows

R Version:
platform = x86_64-w64-mingw32
arch = x86_64
os = mingw32
system = x86_64, mingw32
status =
major = 3
minor = 0.2
year = 2013
month = 09
day = 25
svn rev = 63987
language = R
version.string = R version 3.0.2 (2013-09-25)
nickname = Frisbee Sailing

Windows 7 x64 (build 7601) Service Pack 1

Locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

Search Path:
.GlobalEnv, package:data.table, package:stats, package:graphics,
package:grDevices, package:utils, package:datasets, package:methods,
Autoloads, package:base

This package has a bug submission web page, which we will now attempt
to open. The information above may be useful in your report. If the web
page doesn't work, you should send email to the maintainer,
Matthew Dowle [email protected].

[R-Forge #5408] as.data.table.table creates names in an incorrect order

Submitted by: Benjamin Barnes; Assigned to: Arun ; R-Forge link

The names of the resulting data.table are in the incorrect order. For example,

set.seed(123)

DT <- data.table(XX = sample(LETTERS[1:5], 1000, replace = TRUE),
    yy = sample(1:5, 1000, replace = TRUE))

as.data.table(DT[, table(XX, yy)])
## Names are c("yy", "XX", "N")
## Should be c("XX", "yy", "N"), as in as.data.frame

sessionInfo()

# R version 3.0.2 (2013-09-25)
# Platform: x86_64-w64-mingw32/x64 (64-bit)

# locale:
# [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
# [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
# [5] LC_TIME=German_Germany.1252    

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.2

# loaded via a namespace (and not attached):
# [1] plyr_1.8.1     Rcpp_0.11.0    reshape2_1.2.2 stringr_0.6.2  tools_3.0.2

[R-Forge #5527] Different scoping rule when assigning by reference a new column

Submitted by: Michele Carriero; Assigned to: Arun ; R-Forge link

Given the following table dt:

dt<-data.table(id=1:5, var=letters[1:5])

If I need to retrieve programmatically a column I use eval(parse(text=variable)). But when the variable, containing (part of) the column name to retrieve, is also named alike another column in the table I have two different outcomes:

id<-"va"
dt[, eval(parse(text=paste0(id,"r")))]
# [1] "a" "b" "c" "d" "e"

dt[, id2:=eval(parse(text=paste0(id,"r")))]
# Error in parse(text = paste0(id, "r")) : <text>:1:2: unexpected symbol
1: 1r
    ^

I thought that the above was not intended to be so. I would expect the same result (probably the second).

-

[R-Forge #5424] duplicated.data.table doesn't work with by=FALSE

Submitted by: Arun ; Assigned to: Arun ; R-Forge link

duplicated.data.table documentation states that when the by argument is FALSE or NULL, it'll consider all the columns. However, by=FALSE will end up in error.

Even more, by=TRUE works the way by=FALSE should.

In an effort to fix this discrepancy, I'll remove the TRUE/FALSE dependency of 'by=' altogether and stick to by=NULL. No reason to have one logical argument working and the other ending in an error. Just more confusing.

Fix on the way.

-

[R-Forge #5583] order doesn't sort correctly in data.table's i

Submitted by: Garrett See; Assigned to: Arun ; R-Forge link

This is with revision 1260. When order() is in i, it does not respect abs()

library(data.table)
data.table 1.9.3 For help type: help("data.table")
order(abs(c(1, -2, 3)))
[1] 1 2 3
data.table(x=c(1, -2, 3))[order(abs(x))]
x
1: -2
2: 1
3: 3

[R-Forge #5435] Using "digits" in in "print" function has no effect on data.table objects

Submitted by: Matthew Beckers; Assigned to: Arun ; R-Forge link

I came across this issue using version 1.8.10, and it is also discussed here by others http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-June/001882.html

[R-Forge #5376] When DT is empty, DT[, newcol:=max(b), by=a] does not add the column

Submitted by: shubh bansal; Assigned to: Arun ; R-Forge link

In an empty data.table, defining a new column by reference by group does not add the column to the table.

In the example below, I would expect an empty column 'c' to be added to the data.table

dt1=data.table(a=character(0),b=numeric(0))
dt1
# Empty data.table (0 rows) of 2 cols: a,b
dt1[, c:=max(b), by='a']
dt1
# Empty data.table (0 rows) of 2 cols: a,b

[R-Forge #5519] dcast.data.table fails from within a package that imports data.table

Submitted by: K Davis; Assigned to: Arun ; R-Forge link

I create a test package with the following setup

NAMESPACE:

import(data.table)
export("testFunction")

DESCRIPTION:

Package: testPackage
Type: Package
Title: What the package does (short line)
Version: 1.0
Date: 2014-03-31
Author: Who wrote it
Maintainer: Who to complain to <[email protected]>
Description: More about what it does (maybe more than one line)
License: GPL (>= 2)
Imports: data.table (>= 1.9.3)

R/testFunction.R:

testFunction <- function() {
  dt <- data.table(customerId=c(1:10),categoryId=c(1:10),unitSales=c(1:10))
  invisible(dcast.data.table(dt,customerId~categoryId,value.var="unitSales",fill=0L))
}

For the full package see the attached tar.gz. I build the package, install it, and then use it as follows:

# R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
# Platform: x86_64-pc-linux-gnu (64-bit)

library(testPackage)
testFunction()
# Error in do.call("CJ", list(1:10, 1:10)) : could not find function "CJ"

This is the bug. The function testFunction() can not find the data.table function CJ() despite the fact that data.table was imported.

However, if I have the package depend upon data.table, the problem goes away.

Looking at the sources for data.table 1.9.3 I see only one call of the form do.call(CJ...) it is in the file data.table.R in the function as.data.table.table() which provides some of the as() functionality for data.table and is registered with R through the setAs()

From this behaviour my guess is that functions registered through setAs() only have access to the globally available packages (The problem goes away when the import is replaced with a depends.). So, any functions registered through setAs() should not depend upon functions exported from data.table as these functions need not be available in the environment setAs() executed in.

So, my guess as to a solution would be to either re-implement the functionality of CJ() in the function as.data.table.table() or to find some other way of implementing as.data.table.table() that does not require a call to CJ().

[R-Forge #5375] CJ and setkey sort character data differently

Submitted by: Malcolm Hawkes; Assigned to: Nobody; R-Forge link

vec1 <- c("Corp", "CORP")
vec2 <- 1:3
dt <- CJ(vec1, vec2)
setkey(dt, V1, V2)

Creates warning

Warning in setkeyv(x, cols, verbose = verbose) :
Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

CJ creates

 V1 V2

1: Corp 1
2: Corp 2
3: Corp 3
4: CORP 1
5: CORP 2
6: CORP 3

while doing setkey

And it’s keyed as you would expect by V1 then V2

key(dt)
[1] "V1" "V2"

But after doing setkey you have

 V1 V2

1: CORP 1
2: CORP 2
3: CORP 3
4: Corp 1
5: Corp 2
6: Corp 3

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference

We've currently gone to 2E9 rows (the 32bit index limit) with 9 columns (100GB). See benchmarks page on wiki.

Ideally it would be great to compare all available tools that are either specifically developed for large in-memory data manipulation or are capable of handling data at these sizes much better than base. Of course base-R should also be included, typically as control.

Aspect of benchmarking should be to highlight not just run time (speed), but also memory usage. The sorting/ordering by reference, sub-assignment by reference etc.. features, for example, at this data size should display quite clearly on speed and memory gains attainable.

[R-Forge #5688] Column subset on DTs with duplicate names by index still returns first dup column

Submitted by: Arun ; Assigned to: Arun ; R-Forge link

Best with an example:

require(data.table) ## 1.9.3
DT <- data.table(x=1:5, x=6:10)
DT[, 1:2, with=FALSE]

It's clear that the columns to choose are 1st and 2nd. Still, it gives back just the first column:

data.frame does this very nicely, although it renames the columns (which we don't have to do).

as.data.frame(DT)[, 1:2]
  x x.1
1 1   6
2 2   7
3 3   8
4 4   9
5 5  10

-

[R-Forge #5443] join and get don't like each other

Submitted by: Eduard Antonyan; Assigned to: Arun ; R-Forge link

I've been getting unpredictable behaviours when using joins and get together. Here's an example that results in an error (in my real-life example the situation is actually much worse, in that I don't get an error but get incorrect results instead - I wasn't able to replicate it in a small example and am hoping it's the same issue as the one below):

dt1 = data.table(a = 1:2, b = 1:2, key = 'a')
dt2 = data.table(a = 1:2, c = 2:1, key = 'a')

dt1[dt2, list(c, get('b'))]
#Error in rep(x[[i]], length.out = mn) : 
#  attempt to replicate an object of type 'builtin'

These all work though:

dt1[dt2, list(c, b)]
dt1[dt2, list(a, get('b'))]
dt1[dt2, list(b, get('b'))]
dt1[dt2][, list(c, get('b'))]

[R-Forge #5405] unqiue fails when data.table is NULL

Submitted by: agstudy agstudy; Assigned to: Arun ; R-Forge link

Amazing but this

unique(data.table(NULL))

returns an error:

Error in duplicated.data.table(x, incomparables, tolerance, by, ...) :
invalid subscript type 'list'

should return NULL data.table.

-

[R-Forge #5682] data.table seems to get columns confused in a simple assignment using J

Submitted by: David Slate; Assigned to: Nobody; R-Forge link

I hope the following log is self-explanatory. Either I am confused, or data.table_1.9.2 apparently gets mixed up about the correspondence between column names and columns:

Script started on Wed May 14 01:02:01 2014
1�]0;david@LC2430HD: ~~�david@LC2430HD:~~$ date
Wed May 14 01:02:04 CDT 2014
1�]0;david@LC2430HD: ~~�david@LC2430HD:~~$ uname -a
Linux LC2430HD 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
1�]0;david@LC2430HD: ~~�david@LC2430HD:~~$ cat dt-bug.r

!/usr/bin/env Rscript

options( echo = TRUE)
library( data.table)
sessionInfo()
DT <- data.table( x = 5, a = "blah", y = 8)
setkey( DT, a)
DT
DT[ J( "blah")]
DT[ J( "blah")]$y
DT[ J( "blah")]$y <- 3
DT
1�]0;david@LC2430HD: ~~�david@LC2430HD:~~$
1�]0;david@LC2430HD: ~~�david@LC2430HD:~~$ ./dt-bug.r

library( data.table)
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets base

other attached packages:
[1] data.table_1.9.2

loaded via a namespace (and not attached):
[1] methods_3.0.0 plyr_1.8 reshape2_1.2.1 stringr_0.6

DT <- data.table( x = 5, a = "blah", y = 8)
setkey( DT, a)
DT
x a y
1: 5 blah 8
DT[ J( "blah")]
a x y
1: blah 5 8
DT[ J( "blah")]$y
[1] 8
DT[ J( "blah")]$y <- 3
Warning messages:
1: In [<-.data.table(*tmp*, J("blah"), value = list(a = "blah", :
NAs introduced by coercion
2: In [<-.data.table(*tmp*, J("blah"), value = list(a = "blah", :
Coerced 'double' RHS to 'character' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 1 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
DT
x a y
1: NA 5 3

1�]0;david@LC2430HD: ~~�david@LC2430HD:~~$ exit
Script done on Wed May 14 01:02:33 2014

[R-Forge #5735] Grouping integer64 Bug

Submitted by: Nico S; Assigned to: Nobody; R-Forge link

It seems as grouping integer64 does not work for some numbers. Try:

dt1 <- data.table(id=as.integer64(c(86200,86201, 4797277036,9218868437227407266,47975577036, 86200,9218868437227307266, 4797277036,9218868437227407266,47975577036)), info=rep("a",10))

dt2 <- data.table(id=as.integer64(c(86200,4797277034, 4797277036,9218868437227407266)), info=c("d","e","f","g") )

dt3 <- merge(dt1,dt2,by="id")

dt3[,length(info.x),by=id]

It should be 2 for ID 86200. Found it by accident because it completely screwed the results in very big dataset.

But anyway, really great package. Appreciate it.

[R-Forge #5575] Grouping by a date, converts the date column to integer inside j

Submitted by: Mike Crowe; Assigned to: Arun ; R-Forge link

A minimal example the shows it is given below:

X = data.table(d = c(as.Date("2008-01-01"),as.Date("2008-01-01")), x = c(1,2))
X[, .SD[, d], by = d]

Result

        d    V1

1: 2008-01-01 13879

The .SD object has a column of numeric for d instead of date. Why does this matter? It matters when you are using the date in the j field calculation, for example:

X[, day(d) * sum(x), by = d]

returns an error which doesn't make a lot of sense at first (until you understand the above):

Error in as.POSIXlt.numeric(x, tz = tz(x)) : 'origin' must be supplied

This seems to be a new bug introduced recently.

[R-Forge #5733] Persistent error after trying to use levels of non-factors for CJ

Submitted by: Anja Mirenska; Assigned to: Nobody; R-Forge link

By chance, I attempted to create a combination of "levels" without having converted the columns to factors. This throws an error which is okay. However, after that, no matter what I try to do with this data.table, I get the same error again and again. Here is an MWE:

library(data.table)

data.table 1.9.2 For help type: help("data.table")

DT <- data.table(Feature1 = c("yes", "yes", "no", "no"),
Feature2 = c("yes", "yes", "yes", "no"),
Feature3 = c("no", "yes", "yes", "no"),
Var1 = c("yes", "no", "no", "yes"),
Var2 = c("yes", "no", "yes", "yes"))

setkey(DT, Feature1)

DT[CJ(levels(Feature1), levels(Feature2), levels(Feature3)),
list(Var1.count = .N)]

Error in forder(y) : DT is an empty list() of 0 columns

setkey(DT, Feature1)

Error in typeof(.xi) %chin% c("integer", "logical", "character", "double") :

Internal error: savetl_init checks failed (0 100 0x0000000012681e08 0x0000000013f27750). Please report to datatable-help.

At this point, I have to restart the R session to get rid of this error.

I'm using Windows 7 x64, R Version 3.0.3, data.table 1.9.2

[R-Forge #5387] Reproducible SEGFAULT when joining on character column with zero-rows

Submitted by: Ricardo Saporta; Assigned to: Nobody; R-Forge link

If joining on a character column where the inner DT has no rows, a segfault occurs.

Reproducible example below

Setkey for join

setkey(DT, A)

Join, with no rows selected. No problem

DT[ DT[FALSE] ]

HOWEVER, IF A IS CHARACTER, SEGFAULT WILL RESULT

DT[, A:= as.character(A)]

Setkey for join

setkey(DT, A)

THIS WILL CAUSE SEGFAULT

DT[ DT[FALSE] ]

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.8.11 sos_1.3-8 brew_1.0-6

loaded via a namespace (and not attached):
[1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2

ByteCompile: TRUE
Repository: R-Forge
Repository/R-Forge/Project: datatable
Repository/R-Forge/Revision: 1160

-

[R-Forge #5612] rbind.data.table issue in the presence of empty data.tables

Submitted by: Arun ; Assigned to: Arun ; R-Forge link

As shown here:
http://stackoverflow.com/q/23216033/559784

-

[R-Forge #5423] j-expression such as: y * eval(parse(text="1*2")) should work directly without needing "(" wrap.

Submitted by: Arun ; Assigned to: Arun ; R-Forge link

This SO question explains nicely.

http://stackoverflow.com/questions/22375404/unable-to-use-evalparse-in-data-table-function

-

[R-Forge #5582] When DT is empty, unique(DT, by=key(DT)) returns a data.table with one row, all NAs

Submitted by: shubh bansal; Assigned to: Arun ; R-Forge link

dt=data.table(a=character(0), b=numeric(0), key='a')
unique(dt)
#     a  b
#1: NA NA

-

[R-Forge #5379] dcast.data.table's doesn't honor `drop=TRUE` param correctly.

Submitted by: Steve Lianoglou; Assigned to: Arun ; R-Forge link

Using v1.8.11 (rev 1170):

The drop=TRUE parameter isn't being honored as it should. Witness that rows for "e" and "f" are included from dcast.data.frame, but not dcast.data.table:

library(reshape2)
library(data.table)

Expected

df <- data.frame(a=factor(sample(letters[1:3], 10, replace=TRUE), letters[1:5]),
b=factor(sample(tail(letters, 5), 10, replace=TRUE)))
dcast(df, a ~ b, drop=FALSE)

a v w x z

1| a 1 0 3 1

2| b 0 1 1 0

3| c 0 2 0 1

4| d 0 0 0 0

5| e 0 0 0 0

Unexpected

dt <- as.data.table(df)
dcast.data.table(dt, a ~ b, drop=FALSE)

a v w x z

1: a 1 0 3 1

2: b 0 1 1 0

3: c 0 2 0 1

[R-Forge #5437] By operations with factors

Submitted by: Christophe Dervieux; Assigned to: Arun ; R-Forge link

Hi,

I have updated data.table package to 1.9.2 recently from 1.8.10 and I found errors on my previous code.

See reproductible example below:

On 1.8.10 :

DT <- data.table(X = factor(2006:2012), Y = rep(1:7, 2))
DT[, Z := paste(X, .N, sep = " - "), by = list(X)][]

       X     Y        Z
 1: 2006 1 2006 - 2
 2: 2007 2 2007 - 2
 3: 2008 3 2008 - 2
 4: 2009 4 2009 - 2
 5: 2010 5 2010 - 2
 6: 2011 6 2011 - 2
 7: 2012 7 2012 - 2
 8: 2006 1 2006 - 2
 9: 2007 2 2007 - 2
10: 2008 3 2008 - 2
11: 2009 4 2009 - 2
12: 2010 5 2010 - 2
13: 2011 6 2011 - 2
14: 2012 7 2012 - 2

In column Z, I get the level of the factor column X
pasted with count '.N' as expected

However, in the 1.9.2, with same code :

DT<-data.table(X=factor(2006:2012),Y=rep(1:7,2))
DT[,Z:=paste(X,.N,sep=" - "),by=list(X)][]

        X     Y  Z
 1: 2006 1 1 - 2
 2: 2007 2 2 - 2
 3: 2008 3 3 - 2
 4: 2009 4 4 - 2
 5: 2010 5 5 - 2
 6: 2011 6 6 - 2
 7: 2012 7 7 - 2
 8: 2006 1 1 - 2
 9: 2007 2 2 - 2
10: 2008 3 3 - 2
11: 2009 4 4 - 2
12: 2010 5 5 - 2
13: 2011 6 6 - 2
14: 2012 7 7 - 2

as results, I do not get levels of factor column X but the numeric values associated with the level.

is this working normally? Why has it changed? Is that a bug?

I use this kind of procedure to make labels for ggplot. All my previous code is not working anymore. It's kind of annoying.

Thanks

Christophe

[R-Forge #5647] Mysterious crash with `:=` and NAs while grouping

Submitted by: Arun ; Assigned to: Arun ; R-Forge link

The title may not be appropriate (as I've not looked into the issue yet, just that it crashes the session or gives a warning occasionally):

Code on behalf of Zack:

library(data.table)
dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07",
                                    "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14"))
dt[,forward_date:=c(rep(NA,5),date),by='strip']

It crashes or gives error: Link

[R-Forge #5444] eval(parse is not working in (complex) j-expressions

Submitted by: Eduard Antonyan; Assigned to: Nobody; R-Forge link

Here's an example:

rm(dt)
rm(dt1)
dt = data.table()
dt[, {dt1 = data.table(a = 1:4); eval(parse(text = 'dt1[, a := 2]'))}]

Error in eval(expr, envir, enclos) : object 'dt1' not found

It works fine without the eval/parse:

dt[, {dt1 = data.table(a = 1:4); dt1[, a := 2]}]

old and bad example, left here for reference

dt = data.table()
dt[, {eval(parse(text = 'print("boo")')); NULL}]

NULL

Pretty sure this is a recently introduced bug and it used to work before.

[R-Forge #5417] keyby doesn't work in a presence of a key and a filtering condition

Submitted by: Eduard Antonyan; Assigned to: Nobody; R-Forge link

This bug has been introduced some time between 1.8.11 and 1.9.2:

dt = data.table(a = 1:3, c = c("X", "P", "X"), d = 1:3, key = 'a')
dt[TRUE, sum(d), keyby = c]
#   c V1
#1: X  4
#2: P  2

When not filtering, it works correctly, but the filtering screws something up:

dt[, sum(d), keyby = c]
#   c V1
#1: P  2
#2: X  4

A couple more demos of the problem:

key(dt[TRUE, sum(d), keyby = c])
# [1] "c"

setkey(dt[TRUE, sum(d), keyby = c], c)
# Warning message:
# In setkeyv(x, cols, verbose = verbose) :
#   Already keyed by this key but had invalid row order, key rebuilt. If you didn't go #under the hood please let datatable-help know so the root cause can be fixed.

[R-Forge #5471] Merge broken

Submitted by: Stefan Fritsch; Assigned to: Nobody; R-Forge link

I'm sorry for the vague summary but I'm not really sure what the problem is.

First off unlike my previous issues I'm currently working with ASCII text, no Encoding issues.

Basically Version 1 (group then merge) fails some joins but not others. There is no obvious difference between the failed and joined values of Region like special characters or whatever.

Version 2 (merge then group), still a 100% data.table solution, works!

So the problem must happen during the grouping operation in A. I have no idea how or why.

Could we please get a slow version of merge as an option that uses a more robust character matching? Perhaps the built-in one? I like the speed but merge on character vectors is so fragile that you have to check every result separately which kinda defeats the purpose. =)

I'm sorry, I'll try to find a minimal working example over the next day, but I don't have the time for that right now.

Version 1

D<-A[Year==2011,list(Spending=sum(y Spending)),Region]
merge(D,
B[Year==2011],
by="Region")

Version 2

merge(A[Year==2011],
B[Year==2011],
by="Region")[,list(Spending=sum(y Spending)),Region]

Version 3 - what I did

merge(data.frame(A[Year==2011,list(Spending=sum(y Spending)), Region]),
data.frame(B[Year==2011]),
by="Region")

rdatatable / data.table Goto Github PK

data.table's People

Contributors

Stargazers

Watchers

Forkers

data.table's Issues

remove the rows corresponding to ix

conversely, keep rows not specified by ix

The trouble starts when you want to create a new

column z that contains values only for those rows

not corresponding to ix

DON’T DO THIS: IT SOMEHOW CONTAMINATES THE UNDERLYING STRUCTURE

OF THE DATA TABLE OBJECT AND A SUBSEQUENT PRINTING OF THAT OBJECT (ONCE OR TWICE)

WILL (LIKELY) CAUSE R TO CRASH!

Instead, use the ! sign, which signifies the NOT condition

!/usr/bin/env Rscript

data.table 1.9.2 For help type: help("data.table")

Error in forder(y) : DT is an empty list() of 0 columns

Error in typeof(.xi) %chin% c("integer", "logical", "character", "double") :

Internal error: savetl_init checks failed (0 100 0x0000000012681e08 0x0000000013f27750). Please report to datatable-help.

Setkey for join

Join, with no rows selected. No problem

HOWEVER, IF A IS CHARACTER, SEGFAULT WILL RESULT

Setkey for join

THIS WILL CAUSE SEGFAULT

Expected

a v w x z

1| a 1 0 3 1

2| b 0 1 1 0

3| c 0 2 0 1

4| d 0 0 0 0

5| e 0 0 0 0

Unexpected

a v w x z

1: a 1 0 3 1

2: b 0 1 1 0

3: c 0 2 0 1

Error in eval(expr, envir, enclos) : object 'dt1' not found

NULL

Version 1

Version 2

Version 3 - what I did

Recommend Projects

Recommend Topics

Recommend Org