mrdwab / splitstackshape Goto Github PK
View Code? Open in Web Editor NEWR functions to split concatenated data, conveniently stack columns of data.frames, and conveniently reshape data.frames.
R functions to split concatenated data, conveniently stack columns of data.frames, and conveniently reshape data.frames.
I was attempting to answer this SO question using:
require(splitstackshape)
merged.stack(d, id="PID", var.stubs=c("Cue"), sep="var.stubs")
# PID .time_1 Cue
# 1: 1 1 1
# 2: 1 1 3
# 3: 1 1 5
# 4: 1 2 2
# 5: 1 2 5
# 6: 1 2 5
# 7: 2 1 1
# 8: 2 1 3
# 9: 2 1 5
#10: 2 2 2
#11: 2 2 5
#12: 2 2 5
Unless I'm mistaken as to what the code is supposed to do, this isn't the right output.
sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] splitstackshape_1.4.2 data.table_1.9.5
# loaded via a namespace (and not attached):
# [1] chron_2.3-45 tools_3.1.2
See this SO question for reference.
trim
slows things down quite a bit. Either make it optional or try to find a much faster alternative.
The concatenated column, 'response', in my data.table,'dt.gdoc' contains number string with no separator, e.g. "13464". I wanted to split this column into a list using
cSplit_l(dt.gdoc,'response',sep='',drop=T)
However, when I attempted to run this code, I got the following error:
Error in gsub(sprintf("\s+[%s]\s+|\s+[%s]|[%s]\s+", delim, delim, :
invalid regular expression '\s+[]\s+|\s+[]|[]\s+', reason 'Missing ']''
Demo:
library(splitstackshape)
library(dplyr)
CT <- tbl_df(head(concat.test))
CT %>% cSplit("Likes")
# Error in `[.tbl_df`(indt, , splitCols, with = FALSE) :
# unused argument (with = FALSE)
CT %>% data.frame %>% cSplit("Likes")
# Name Siblings Hates Likes_1 Likes_2 Likes_3 Likes_4 Likes_5
#1: Boyd Reynolds , Albert , Ortega 2;4; 1 2 4 5 6
#2: Rufus Cohen , Bert , Montgomery 1;2;3;4; 1 2 4 5 6
#3: Dana Pierce 2; 1 2 4 5 6
#4: Carole Colon , Michelle , Ballard 1;4; 1 2 4 5 6
#5: Ramona Snyder , Joann , 1;2;3; 1 2 5 6 NA
#6: Kelley James , Roxanne , 1;4; 1 2 5 6 NA
Check for update/compatibility with cSplit
and replace where possible.
concat.split.compact
concat.split.expanded
concat.split.list
concat.split.multiple
Other work:
concat.split.expanded
and concat.split.list
.concat.split.DT
(https://gist.github.com/mrdwab/6873058).stratified
(https://gist.github.com/mrdwab/933ffeaa7a1d718bd10a).vGrep
out as a non-exported function since it is used in a couple of places.getanID
needs to be made more robust.data.table
s (by reference) when we use :=
. Something like if (!is.data.table(dataset)) dataset <- as.data.table(dataset) else dataset <- copy(dataset)
might be safer.Explore:
id.vars
be made redundant?cSplit
be modified to accept vectors as its input?expandRows
, for example)?Using the data from example(Stacked)
, note that if you do:
set.seed(1)
mydf <- data.frame(id_1 = 1:6, id_2 = c("A", "B"),
varA.1 = sample(letters, 6),
varA.2 = sample(letters, 6),
varA.3 = sample(letters, 6),
varB.2 = sample(10, 6),
varB.3 = sample(10, 6),
varC.3 = rnorm(6))
mydf
Stacked(data = mydf, id.vars = c("id_1", "id_2"),
var.stubs = c("varA", "varB", "varC"),
sep = "\\.")
you'd get a list
(as expected).
You'd also get a list
if you did:
Stacked(data = mydf, id.vars = c("id_1", "id_2"),
var.stubs = "varA", sep = "\\.")
This doesn't make as much sense...
Perhaps something like this will work:
csDataTable <- function(dataset, splitcol, sep, drop = FALSE) {
if (is.numeric(splitcol)) splitcol <- names(dataset)[splitcol]
x <- getwd()
setwd(tempdir())
writeLines(as.character(dataset[[splitcol]]), "temp.txt")
Split <- fread("temp.txt", sep=sep)
setnames(Split, paste(splitcol, seq_along(Split), sep = "_"))
setwd(x)
if (!is.data.table(dataset)) dataset <- data.table(dataset)
if (isTRUE(drop)) dataset[, eval(splitcol) := NULL]
cbind(dataset, Split)
}
First, thanks for this package, it can be very useful, and the code and documentation are very clean.
There is one thing I miss in terms of features : the ability to have a concat.split.expanded
for character variables. Suppose you deal with a variable listing the preferred music styles of some people. You end up with something like this :
music
1 rock,electro
2 electro
3 rock,karadi rhymes
If you want to reuse these variables, for example for crossing with the gender or age, I find it useful to transform it in the following way :
music.electro music.rock music.karadi_rhymes
1 1 1 0
2 1 0 0
3 0 1 1
A sort of concat.split.expanded
for characters in binary mode. As such, you can cross tabulate table(music.electro, gender)
, etc.
That's what the multi.split
function from my questionr
package does, but it is not as well written and as featureful as your concat.split
. That's why I wondered if you think it could be implemented in splitstackshape
, or if it would be incompatible with other functions such as Reshape
.
Thanks !
Hey Ananda Mahto,
I've titled it the same as in stack.
http://stackoverflow.com/questions/20075135/multiple-separators-for-the-same-file-input-r
Cawarnold / splitstring_-df3 / Mobile-App
R code I tried.
df <- results8[150:200,]
View(df)
library(splitstackshape)
_
df1 <- concat.split(data = df, split.col = "V3", sep = "_", drop = TRUE)
-
df2 <- concat.split(data = df1, split.col = "V3_5", sep = "-", drop = TRUE)
View(df2)
Hi,
I noticed some strange behavior using cSplit.
My input (data) looks like this:
head(data)
SNP1 SNP2 SNP3
1 AKL001 TT CC TT
2 AKL002 TT CC TT
3 BSC001 TT CC TT
4 BSC080 CT CC TT
5 BSC087 TT CC TT
6 BSC114 CT CC TT
df<-cSplit(data, grep("SNP", names(data)) , "", stripWhite = FALSE)
and the result is
head(df)
V1 SNP1_1 SNP1_2 SNP2_1 SNP2_2 SNP3_1 SNP3_2
1: AKL001 T T C C TRUE TRUE
2: AKL002 T T C C TRUE TRUE
3: BSC001 T T C C TRUE TRUE
4: BSC080 C T C C TRUE TRUE
5: BSC087 T T C C TRUE TRUE
6: BSC114 C T C C TRUE TRUE
so it shows weird behavior when there is only TT present in the column. Any help would be appreciated!
Check for any functions assigning to the global environment like FacsToChars
. Fix so the package passesR check --as-cran
.
Not required, but would make things easier....
I'm seeing some weird interactions between using a regex to match on whitespacing and the characters '_s'.
d <- data.frame(id=1,list=c('a b_s ccc s_ddd eee_s fff s_g s_hhh iii'))
cSplit(d, c('list'), sep = "\\s", direction = "long", fixed = FALSE,
drop = TRUE, stripWhite = TRUE, makeEqual = FALSE,
type.convert = FALSE)
# WRONG:
# id list
#1: 1 a
#2: 1 b_scccs_ddd
#3: 1 eee_sfffs_gs_hhh
#4: 1 iii
# Contrast with
cSplit(d, c('list'), sep = " ", direction = "long", fixed = TRUE,
drop = TRUE, stripWhite = TRUE, makeEqual = FALSE,
type.convert = FALSE)
# As expected:
# id list
#1: 1 a
#2: 1 b_s
#3: 1 ccc
#4: 1 s_ddd
#5: 1 eee_s
#6: 1 fff
#7: 1 s_g
#8: 1 s_hhh
#9: 1 iii
Instead of 1.10.5, can lower to 1.10.4 and maybe use hasArg(logical10)
(or packageVersion("data.table") < "1.10.5"
or something) to decide whether to include the argument or not in fread
.
Not sure if that's a good idea or not....
Previously, with strsplit
, it was easy to leave the option open to split on a regular expression too. Now?
See this answer on Stack Overflow for some ideas. Switch to the f1
option, perhaps.
See: http://stackoverflow.com/questions/33832085/replicate-entries-in-dataframe-in-r
If expandRows
were used on this, it would just return a factor vector.
I didn't expect to get a result like the one below:
df<-data.frame(x=c(1,1,2,2,2,7),t=1:6)
df
x t
1 1 1
2 1 2
3 2 3
4 2 4
5 2 5
6 7 6
stratified(df,group=1,size=1)
x t
1: 1 1
2: 2 5
3: 1 1
I expected to get exactly one row of values from stratum "1", "2", and "7", respectively. Instead, I got two from "1", one from "2", and none from "7".
How is this possible? Have I misunderstood something?
Best regards,
Magnus
Example of code to reproduce:
dt1 <- fread("V1 V2 V3
x b;c;d 1
y d;ef 2
z d;ef 3
m tmp 4
n tmp 5")
dt1[4, V2:=''] # this record will be lost because V2 value is empty string
dt1[5, V2:=NA_character_] # NA value is processed correctly
cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long')
as you can see, record 4 (where V2=='') is lost.
Following my SO post here, I would appreciate if you could fix the bug.
data1<-structure(list(reason = c("1", "1", NA, "1", "1", "4 5", "1",
"1", "1", "1", "1", "1 2 3 4", "1 2 5", NA, NA)), .Names = "reason", class = "data.frame", row.names = c(NA,
-15L))
#loading packages
library(data.table)
library(splitstackshape)
cSplit_e(setDT(data1),1," ",mode = "value") # with NA's doesn't work
Error in seq.default(min(vec), max(vec)) : 'from' must be a finite number
data2<-na.omit(setDT(data1),cols="reason") # removing NA's
cSplit_e(data2,1," ",mode = "value") # without NA's works
reason reason_1 reason_2 reason_3 reason_4 reason_5
1: 1 1 NA NA NA NA
2: 1 1 NA NA NA NA
3: 1 1 NA NA NA NA
4: 1 1 NA NA NA NA
5: 4 5 NA NA NA 4 5
6: 1 1 NA NA NA NA
7: 1 1 NA NA NA NA
8: 1 1 NA NA NA NA
9: 1 1 NA NA NA NA
10: 1 1 NA NA NA NA
11: 1 2 3 4 1 2 3 4 NA
12: 1 2 5 1 2 NA NA 5
concat.split return dense matrix, but sometimes needs sparse maxtrix
Because of how the function is coded, if you had a set of variables named, say, incA
, valA
, incB
, valB
, setting sep = "inc|val"
will work to strip away that part of the variable name, thus leaving us with A
and B
as the times (as we would want).
Perhaps explore whether this can be made part of NoSep
.
A few variables are created automatically with several functions. Would it be possible to provide a user option to supply a value to override the defaults?
The current charBinaryMat
can be made much faster with matrix indexing:
charBinaryMat <- function(listOfValues, fill = NA) {
A <- unlist(listOfValues, use.names = FALSE)
lev <- sort(unique(A))
m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev),
dimnames = list(NULL, lev))
Row <- vapply(listOfValues, length, 1L)
Row <- rep(seq_along(Row), Row)
Col <- match(A, lev)
m[cbind(Row, Col)] <- 1L
m
}
Some sample data to test it with:
set.seed(1)
A = sample(10, 100000, replace = TRUE)
str <- sapply(seq_along(A), function(x)
paste(sample(LETTERS[1:10], A[x]), collapse = " "))
lov <- strsplit(str, " ", fixed=TRUE)
move_me
dist2df
[.array
[.ftable
ftable2dt
dupe_thresh
list_unlister
make_me_NA
na_last
(add a "row + col" option)Riffle
(but it needs a LOT of work and testing)shifter
, shuffler
(?)sort_ends
(? -- How useful?)tabulate_int
(after modifying to work with factors/characters)unlist_by_row
/unlist_by_col
lwSplit
(name?) for the act of splitting once to a "long" form and then again to a "wide" form? Eg: https://stackoverflow.com/q/49182838/1270695, https://stackoverflow.com/q/29120787/1270695What about cases like collapseMe(c("a|b", "c"))
?
An option for blank.lines.skip
needs to be added to read.concat
.
Here is example table:
dt1 <- fread("V1 V2 V3
x xA;xB;xC x1;x2;x3
y yD y1
z zF;zG z1")
and I want to split it by both V2
and V3
columns. You can see that the last record is "wrong": V2
has 2 values while V3
has only one. And that how cSplit()
treats those cases:
# with default arguments:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long')
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: y NA NA
#6: y NA NA
#7: z zF z1
#8: z zG NA
#9: z NA NA
# with `makeEqual = TRUE`:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = T)
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: y NA NA
#6: y NA NA
#7: z zF z1
#8: z zG NA
#9: z NA NA
So, by default it works like with makeEqual = TRUE
while in the help it is said Defaults to FALSE
. Then I tried with FALSE
:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = F)
# Warning in `[.data.table`(indt, , `:=`(eval(splitCols), lapply(X, function(x) { :
# Supplied 5 items to be assigned to 6 items of column 'V3' (recycled leaving remainder of 1 items).
# V1 V2 V3
# 1: x xA x1
# 2: x xB x2
# 3: x xC x3
# 4: y yD y1
# 5: z zF z1
# 6: z zG x1
It recycles V3
elements but it takes it from another group which is kinda unexpected. I think it would be more logical to give one of the following outputs:
# without recycling, fill with NA:
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: z zF z1
#6: z zG NA
# with recycling:
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: z zF z1
#6: z zG z1
concat.split doesn't appear to like apostrophes in the data column.
> df <- data.frame(experience=c("Did Use, Didn't like"))
> df
experience
1 Did Use, Didn't like
> concat.split(df, 1)
Error in FUN(NA_integer_[[1L]], ...) :
argument must be coercible to non-negative integer
If you try it without, it works fine.
I installed it from the package manager, appears to be v1.2.0.
Mac OS X x64, Mavericks
R 3.0.2
Shouldn't add too much processing time, but would allow char_mat
and num_mat
to directly work on data.frame
s (as they are lists anyway).
Expect to work (after modification):
df <- data.frame(V1 = c("A", "B", "C"),
V2 = c("C", "C", NA),
V3 = c("D", "A", "B"))
char_mat(df)
This will presently generate an error.
Is there a reason you add data.table
depends and not import it instead?
https://github.com/Rexamine/stringi
Notably, stri_list2matrix
and the simplify
argument to stri_split*
. 1.8 seconds instead of 6+ seconds on .5M rows.
Is it time for s3r2
?
Hello.
How would you do this with splitstackshape?
https://stackoverflow.com/questions/41163500/r-transform-from-wide-to-long-without-sorting-columns
Transform from Wide to Long all columns with name "name_number" but without reordering/sorting the columns.
Hi, I just notice your work on stackoverflow.
Do you like to integrate the feature of expanding into formula? I have a prototype package to do this:
https://github.com/wush978/FeatureHashing
In this prototype package, I implemented an API so that:
library(FeatureHashing)
data1 <- data.frame(a = c("1,2,3", "2,3,3", "1,3", "3"), type = c("a", "b", "a", "a"), stringsAsFactors = FALSE)
interpret.tag( ~ tag(a, split = ",", type = "existence") + tag(a, split = ",", type = "count"):type, data = data1)
will produces a data.frame with expanded columns and a expanded formula to run some advaced model such as lm
.
Do you want to integrate this feature?
Moreover, I notice that this way will consume lots of memory, so I am wondering if there is a way to directly convert such data.frame to sparse matrix directly. I am still working on this.
Switch to a stack
+ data.table
solution.
Example for strsplit
after stacking:
for (i in 1:2) {
DT[, paste("ind", i, sep = "_") :=
unlist(strsplit(as.character(ind), "_"))[[i]],
by = 1:nrow(DT)]
}
If you have data like this:
dat <- data.frame(v1 = c("a, b", "b, c", "c, d", "d, e", "e, f", "g, h", "i, j, k"))
R will assume that you only have two columns, so it will read the data incorrectly.
Possible solution:
gregexpr
to count the split characterssapply
and max
to determine the max number of columns (minus 1)col.names
to the read.concat()
function to take care of reading the data correctly.first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:
to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated",
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated",
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L,
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA,
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
# Reads Sample_1 Sample_2 Sample_3 Sample_4
# 1: 470987 N2 wt rep1 untreated
# 2: 270891 N2 wt rep1 untreated
# 3: 56114 N2 wt rep1 untreated
# 4: 513902 N2 wt rep2 untreated
# 5: 310722 N2 wt rep2 untreated
# 6: 67263 N2 wt rep2 untreated
The new col names are not very informative, so I usually rename them in an extra step:
setnames(split,
c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
c("Background", "Allele", "Replicate", "Treatment")
)
This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")
Cheers.
NoSep is very limited in its current form. Explore possibilities to split in different ways.
See this question. Sometimes, the IDs will result in duplicated row.names
so reshape(..., direction = "long")
won't work.
The solution could be to first test whether there are any duplicated rows in the IDs columns, and if yes, use ave
to generate a new ID with seq_along
.
This should only be done when direction = "long"
.
The string being read in from stringi would be UTF-8 encoded. See sample data here: https://stackoverflow.com/q/13773770/1270695
As I am not the author of LinearizeNestedList
, either contact author for inclusion in package, or find an alternative. The LinearizeNestedList
function might be overkill for the purposes of CBIND
.
When specifying multiple concatenated columns to be parsed in a big rugged data (~half a million rows), R2.15 and R3 on Linux would report memory-related error message such as "cannot allocate memory" or "long vector not supported". The workaround that I found is to explicitly set makeEqual to False, as failing to do so would generate a large number of blank rows where the number of rows for each unique identifier equals the maximum number of delimiters found among all entries. I suspect this is the source of the memory exhaustion problem.
This indicates that the default value of makeEqual, at least in my experience, is not False, as suggested by the document for the package.
Update: after coercing makeEqual to False, the result are totally off. Leaving makeEqual untouched creates memory problem for large dataset but produces the correct data after all NA rows are removed.
Referring to http://stackoverflow.com/questions/23528882, it seems that splitstackshape:::read.concat
may sometimes open too many textConnection
s at a time.
According to ?connections
, A maximum of 128 connections can be allocated.
Must use title case, not have punctuation, and be shorter than 65 characters....
Let's say I want to pull up columns 3 through 5 I would use dataframe[,3:5] which works perfectly. After using splitstackshape that same command returns [1] 3, 4, 5. If I run fix(dataframe) and close I can use the references again.
Here is a snippet showing the issue (using IMDB data):
head(movies[,1:5])
X Title Year Runtime Genre
tt0000439 1 The Great Train Robbery 1903 11 Short, Western
tt0003037 2 Juve Against Fantomas 1913 61 Crime, Drama
tt0003740 3 Cabiria 1914 148 Adventure, Drama, History
tt0004707 4 Tillie's Punctured Romance 1914 82 Comedy
tt0005960 5 Regeneration 1915 72 Biography, Crime, Drama
tt0006206 6 Les vampires 1915 399 Action, Adventure, Crimemovies <- cSplit(movies, "Genre", sep=",")
head(movies[,1:5])
[1] 1 2 3 4 5fix(movies)
head(movies[,1:5])
X Title Year Runtime Released
1 1 The Great Train Robbery 1903 11 1903-12-01
2 2 Juve Against Fantomas 1913 61 1913-10-02
3 3 Cabiria 1914 148 1914-06-01
4 4 Tillie's Punctured Romance 1914 82 1914-12-21
5 5 Regeneration 1915 72 1915-09-13
6 6 Les vampires 1915 399 1916-11-23
From the "data.table" readme for version 1.9.5, []
is now needed to print the results. All functions would need to be checked for this.
Consider the following data.table:
> dt <- data.table(id=1:3, A_2001=c(1,2,3), B_2001=c(1,3,5), B_2007=c(4,3,2), AC_2007=c(8,9,10))
> dt
id A_2001 B_2001 B_2007 AC_2007
1: 1 1 1 4 8
2: 2 2 3 3 9
3: 3 3 5 2 10
Now if we run Stacked
:
> Stacked(dt, id.vars="id", var.stubs=c("A", "B", "AC"), sep="_")
$A
id .time_1 A
1: 1 2001 1
2: 1 2007 8
3: 2 2001 2
4: 2 2007 9
5: 3 2001 3
6: 3 2007 10
$B
id .time_1 B
1: 1 2001 1
2: 1 2007 4
3: 2 2001 3
4: 2 2007 3
5: 3 2001 5
6: 3 2007 2
$AC
id .time_1 AC
1: 1 2007 8
2: 2 2007 9
3: 3 2007 10
The A
column now has also picked up the values for AC
from 2007.
This is due to the use of grep
on line 2 of Stacked
: any stub that is a subset of another stub, will pick up the values of that stub.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.