mrdwab / splitstackshape Goto Github PK

R functions to split concatenated data, conveniently stack columns of data.frames, and conveniently reshape data.frames.

R 14.34% Rust 0.02% HTML 85.14% CSS 0.49%

splitstackshape's People

Contributors

Stargazers

Watchers

Forkers

jivan517 sritchie73 manisahni xtmgah anukat2015 selcukakbas michaelchirico muraiki de0015 datagram badseby

splitstackshape's Issues

Possible bug in dealing with factor columns?

I was attempting to answer this SO question using:

require(splitstackshape)
merged.stack(d, id="PID", var.stubs=c("Cue"), sep="var.stubs")
#     PID .time_1 Cue
#  1:   1       1   1
#  2:   1       1   3
#  3:   1       1   5
#  4:   1       2   2
#  5:   1       2   5
#  6:   1       2   5
#  7:   2       1   1
#  8:   2       1   3
#  9:   2       1   5
#10:   2       2   2
#11:   2       2   5
#12:   2       2   5

Unless I'm mistaken as to what the code is supposed to do, this isn't the right output.

sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] splitstackshape_1.4.2 data.table_1.9.5     

# loaded via a namespace (and not attached):
# [1] chron_2.3-45 tools_3.1.2

add colClasses to `read.concat`

See: http://stackoverflow.com/q/18641951/1270695

Improve `cSplit` documentation by adding that `splitCols` should be quoted or let is accept unquoted column names too per `data.table`s NSE behaviour

See this SO question for reference.

Add `trim` argument instead of hardcoding it in functions

trim slows things down quite a bit. Either make it optional or try to find a much faster alternative.

throw error when the separator is ''

The concatenated column, 'response', in my data.table,'dt.gdoc' contains number string with no separator, e.g. "13464". I wanted to split this column into a list using

cSplit_l(dt.gdoc,'response',sep='',drop=T)

However, when I attempted to run this code, I got the following error:

Error in gsub(sprintf("\s+[%s]\s+|\s+[%s]|[%s]\s+", delim, delim,  : 
  invalid regular expression '\s+[]\s+|\s+[]|[]\s+', reason 'Missing ']''

Make functions compatible with `tbl_df` objects

Demo:

library(splitstackshape)
library(dplyr)

CT <- tbl_df(head(concat.test))

CT %>% cSplit("Likes")
# Error in `[.tbl_df`(indt, , splitCols, with = FALSE) : 
#   unused argument (with = FALSE)

CT %>% data.frame %>% cSplit("Likes")
#      Name                   Siblings    Hates Likes_1 Likes_2 Likes_3 Likes_4 Likes_5
#1:   Boyd Reynolds , Albert , Ortega     2;4;       1       2       4       5       6
#2:  Rufus  Cohen , Bert , Montgomery 1;2;3;4;       1       2       4       5       6
#3:   Dana                     Pierce       2;       1       2       4       5       6
#4: Carole Colon , Michelle , Ballard     1;4;       1       2       4       5       6
#5: Ramona           Snyder , Joann ,   1;2;3;       1       2       5       6      NA
#6: Kelley          James , Roxanne ,     1;4;       1       2       5       6      NA

See http://stackoverflow.com/questions/32173746/error-running-csplit-when-splitstackshape-data-frame-and-tidyr-dplyr-are-loaded

Questions to update when V2 is out

char_mat: https://stackoverflow.com/q/48888182/1270695

conversion checklist

Check for update/compatibility with cSplit and replace where possible.

concat.split.compact
concat.split.expanded
concat.split.list
concat.split.multiple

Other work:

Create aliases for concat.split.expanded and concat.split.list.
Include concat.split.DT (https://gist.github.com/mrdwab/6873058).
Include stratified (https://gist.github.com/mrdwab/933ffeaa7a1d718bd10a).
Move vGrep out as a non-exported function since it is used in a couple of places.
getanID needs to be made more robust.
Check to make sure that we are not overwriting existing data.tables (by reference) when we use :=. Something like if (!is.data.table(dataset)) dataset <- as.data.table(dataset) else dataset <- copy(dataset) might be safer.

Explore:

Can the id.vars be made redundant?
~~Can cSplit be modified to accept vectors as its input?~~ Don't see an obvious way...
Are there utilities that can be brought in from my other packages (expandRows, for example)?

There is no reason for `Stacked` to create a 1 item list

Using the data from example(Stacked), note that if you do:

set.seed(1)
mydf <- data.frame(id_1 = 1:6, id_2 = c("A", "B"),
                   varA.1 = sample(letters, 6),
                   varA.2 = sample(letters, 6),
                   varA.3 = sample(letters, 6),
                   varB.2 = sample(10, 6),
                   varB.3 = sample(10, 6),
                   varC.3 = rnorm(6))
mydf
Stacked(data = mydf, id.vars = c("id_1", "id_2"),
        var.stubs = c("varA", "varB", "varC"),
        sep = "\\.")

you'd get a list (as expected).

You'd also get a list if you did:

Stacked(data = mydf, id.vars = c("id_1", "id_2"), 
        var.stubs = "varA", sep = "\\.")

This doesn't make as much sense...

Switch from read.table to fread?

Perhaps something like this will work:

csDataTable <- function(dataset, splitcol, sep, drop = FALSE) {
  if (is.numeric(splitcol)) splitcol <- names(dataset)[splitcol]
  x <- getwd()
  setwd(tempdir())
  writeLines(as.character(dataset[[splitcol]]), "temp.txt")
  Split <- fread("temp.txt", sep=sep)
  setnames(Split, paste(splitcol, seq_along(Split), sep = "_"))
  setwd(x)
  if (!is.data.table(dataset)) dataset <- data.table(dataset)
  if (isTRUE(drop)) dataset[, eval(splitcol) := NULL]
  cbind(dataset, Split)
}

See: http://stackoverflow.com/q/19228870/1270695

`concat.split.expanded` for character variables

First, thanks for this package, it can be very useful, and the code and documentation are very clean.

There is one thing I miss in terms of features : the ability to have a concat.split.expanded for character variables. Suppose you deal with a variable listing the preferred music styles of some people. You end up with something like this :

               music
1       rock,electro
2            electro
3 rock,karadi rhymes

If you want to reuse these variables, for example for crossing with the gender or age, I find it useful to transform it in the following way :

  music.electro music.rock music.karadi_rhymes
1             1          1                   0
2             1          0                   0
3             0          1                   1

A sort of concat.split.expanded for characters in binary mode. As such, you can cross tabulate table(music.electro, gender), etc.

That's what the multi.split function from my questionr package does, but it is not as well written and as featureful as your concat.split. That's why I wondered if you think it could be implemented in splitstackshape, or if it would be incompatible with other functions such as Reshape.

Thanks !

Multiple Separators for the same file input R

Hey Ananda Mahto,

I've titled it the same as in stack.
http://stackoverflow.com/questions/20075135/multiple-separators-for-the-same-file-input-r

Cawarnold / splitstring_-df3 / Mobile-App

R code I tried.

df <- results8[150:200,]
View(df)
library(splitstackshape)

split concatenated column by `_`

df1 <- concat.split(data = df, split.col = "V3", sep = "_", drop = TRUE)

split the remaining concatenated part by `-`

df2 <- concat.split(data = df1, split.col = "V3_5", sep = "-", drop = TRUE)
View(df2)

cSplit: strange behaviour when only TT in column

Hi,

I noticed some strange behavior using cSplit.

My input (data) looks like this:

head(data)
      SNP1 SNP2 SNP3
1 AKL001   TT   CC   TT
2 AKL002   TT   CC   TT
3 BSC001   TT   CC   TT
4 BSC080   CT   CC   TT
5 BSC087   TT   CC   TT
6 BSC114   CT   CC   TT

df<-cSplit(data, grep("SNP", names(data)) , "", stripWhite = FALSE)

and the result is

head(df)
       V1 SNP1_1 SNP1_2 SNP2_1 SNP2_2 SNP3_1 SNP3_2
1: AKL001      T      T      C      C   TRUE   TRUE
2: AKL002      T      T      C      C   TRUE   TRUE
3: BSC001      T      T      C      C   TRUE   TRUE
4: BSC080      C      T      C      C   TRUE   TRUE
5: BSC087      T      T      C      C   TRUE   TRUE
6: BSC114      C      T      C      C   TRUE   TRUE

so it shows weird behavior when there is only TT present in the column. Any help would be appreciated!

Remove assignment to global environment

Check for any functions assigning to the global environment like FacsToChars. Fix so the package passesR check --as-cran.

Convert documentation to roxygen markdown

Not required, but would make things easier....

Whitespace regex and '_s' interaction

I'm seeing some weird interactions between using a regex to match on whitespacing and the characters '_s'.

d <- data.frame(id=1,list=c('a b_s ccc s_ddd eee_s fff s_g s_hhh iii'))
cSplit(d, c('list'), sep = "\\s", direction = "long", fixed = FALSE,
                 drop = TRUE, stripWhite = TRUE, makeEqual = FALSE,
                 type.convert = FALSE)
# WRONG:
#   id             list
#1:  1                a
#2:  1      b_scccs_ddd
#3:  1 eee_sfffs_gs_hhh
#4:  1              iii

# Contrast with
cSplit(d, c('list'), sep = " ", direction = "long", fixed = TRUE,
               drop = TRUE, stripWhite = TRUE, makeEqual = FALSE,
               type.convert = FALSE)
# As expected: 
#   id  list
#1:  1     a
#2:  1   b_s
#3:  1   ccc
#4:  1 s_ddd
#5:  1 eee_s
#6:  1   fff
#7:  1   s_g
#8:  1 s_hhh
#9:  1   iii

Lower data.table version dependency level?

Instead of 1.10.5, can lower to 1.10.4 and maybe use hasArg(logical10) (or packageVersion("data.table") < "1.10.5" or something) to decide whether to include the argument or not in fread.

Not sure if that's a good idea or not....

Is `stri_split_fixed` the way to go?

Previously, with strsplit, it was easy to leave the option open to split on a regular expression too. Now?

Possible `binaryMat` or `valueMat` enhancements

See this answer on Stack Overflow for some ideas. Switch to the f1 option, perhaps.

expandRows should have a drop argument for single column data.frames

See: http://stackoverflow.com/questions/33832085/replicate-entries-in-dataframe-in-r

If expandRows were used on this, it would just return a factor vector.

The stratified function gives surprising results

I didn't expect to get a result like the one below:

df<-data.frame(x=c(1,1,2,2,2,7),t=1:6)
df
x t
1 1 1
2 1 2
3 2 3
4 2 4
5 2 5
6 7 6
stratified(df,group=1,size=1)
x t
1: 1 1
2: 2 5
3: 1 1

I expected to get exactly one row of values from stratum "1", "2", and "7", respectively. Instead, I got two from "1", one from "2", and none from "7".

How is this possible? Have I misunderstood something?

Best regards,
Magnus

record is lost if splitCols contains empty value

Example of code to reproduce:

dt1 <- fread("V1 V2 V3
             x b;c;d 1
             y d;ef  2
             z d;ef  3
             m tmp   4
             n tmp   5")
dt1[4, V2:=''] # this record will be lost because V2 value is empty string
dt1[5, V2:=NA_character_] # NA value is processed correctly

cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long')

as you can see, record 4 (where V2=='') is lost.

cSplit_e from splitstackshape package not accounting for NA's?

Following my SO post here, I would appreciate if you could fix the bug.

data1<-structure(list(reason = c("1", "1", NA, "1", "1", "4 5", "1", 
"1", "1", "1", "1", "1 2 3 4", "1 2 5", NA, NA)), .Names = "reason", class = "data.frame", row.names = c(NA, 
-15L))

 #loading packages
 library(data.table)
 library(splitstackshape)

cSplit_e(setDT(data1),1," ",mode = "value") # with NA's doesn't work

Error in seq.default(min(vec), max(vec)) : 'from' must be a finite number

data2<-na.omit(setDT(data1),cols="reason") # removing NA's 

cSplit_e(data2,1," ",mode = "value") # without NA's works
     reason reason_1 reason_2 reason_3 reason_4 reason_5
 1:       1        1       NA       NA       NA       NA
 2:       1        1       NA       NA       NA       NA
 3:       1        1       NA       NA       NA       NA
 4:       1        1       NA       NA       NA       NA
 5:     4 5       NA       NA       NA        4        5
 6:       1        1       NA       NA       NA       NA
 7:       1        1       NA       NA       NA       NA
 8:       1        1       NA       NA       NA       NA
 9:       1        1       NA       NA       NA       NA
10:       1        1       NA       NA       NA       NA
11: 1 2 3 4        1        2        3        4       NA
12:   1 2 5        1        2       NA       NA        5

can you support the sparse matrix for the output

concat.split return dense matrix, but sometimes needs sparse maxtrix

Add an option for `sep=var.stubs` or something similar

Because of how the function is coded, if you had a set of variables named, say, incA, valA, incB, valB, setting sep = "inc|val" will work to strip away that part of the variable name, thus leaving us with A and B as the times (as we would want).

Perhaps explore whether this can be made part of NoSep.

Add option to rename `time` prefix and other created variables

A few variables are created automatically with several functions. Would it be possible to provide a user option to supply a value to override the defaults?

charBinaryMat improvements

The current charBinaryMat can be made much faster with matrix indexing:

charBinaryMat <- function(listOfValues, fill = NA) {
  A <- unlist(listOfValues, use.names = FALSE)
  lev <- sort(unique(A))
  m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev),
              dimnames = list(NULL, lev))
  Row <- vapply(listOfValues, length, 1L)
  Row <- rep(seq_along(Row), Row)
  Col <- match(A, lev)
  m[cbind(Row, Col)] <- 1L
  m
}

Some sample data to test it with:

set.seed(1)
A = sample(10, 100000, replace = TRUE)
str <- sapply(seq_along(A), function(x)
  paste(sample(LETTERS[1:10], A[x]), collapse = " "))

lov <- strsplit(str, " ", fixed=TRUE)

Functions to add

collapseMe may need to be modified

What about cases like collapseMe(c("a|b", "c"))?

blank.lines.skip

An option for blank.lines.skip needs to be added to read.concat.

how cSplit() treats multiple splitCols when they contain different number of fields

Here is example table:

dt1 <- fread("V1 V2       V3
              x  xA;xB;xC x1;x2;x3
              y  yD       y1
              z  zF;zG    z1")

and I want to split it by both V2 and V3 columns. You can see that the last record is "wrong": V2 has 2 values while V3 has only one. And that how cSplit() treats those cases:

# with default arguments:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long')
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

# with `makeEqual = TRUE`:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = T)
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

So, by default it works like with makeEqual = TRUE while in the help it is said Defaults to FALSE. Then I tried with FALSE:

cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = F)
# Warning in `[.data.table`(indt, , `:=`(eval(splitCols), lapply(X, function(x) { :
#     Supplied 5 items to be assigned to 6 items of column 'V3' (recycled leaving remainder of 1 items).
#      V1 V2 V3
#   1:  x xA x1
#   2:  x xB x2
#   3:  x xC x3
#   4:  y yD y1
#   5:  z zF z1
#   6:  z zG x1

It recycles V3 elements but it takes it from another group which is kinda unexpected. I think it would be more logical to give one of the following outputs:

# without recycling, fill with NA:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG NA

# with recycling:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG z1

Using concat.split on data containing an apostrophe causes error

concat.split doesn't appear to like apostrophes in the data column.

> df <- data.frame(experience=c("Did Use, Didn't like"))
> df
            experience
1 Did Use, Didn't like
> concat.split(df, 1)
Error in FUN(NA_integer_[[1L]], ...) : 
  argument must be coercible to non-negative integer

If you try it without, it works fine.

I installed it from the package manager, appears to be v1.2.0.

Mac OS X x64, Mavericks

R 3.0.2

`char_mat`, `num_mat`, `trim_list`, `trim_vec` -- convert to character at the start

Shouldn't add too much processing time, but would allow char_mat and num_mat to directly work on data.frames (as they are lists anyway).

Expect to work (after modification):

df <- data.frame(V1 = c("A", "B", "C"),
                 V2 = c("C", "C", NA),
                 V3 = c("D", "A", "B"))
char_mat(df)

This will presently generate an error.

data.table in imports is sufficient?

Is there a reason you add data.table depends and not import it instead?

Exciting things happening at stringi

https://github.com/Rexamine/stringi

Notably, stri_list2matrix and the simplify argument to stri_split*. 1.8 seconds instead of 6+ seconds on .5M rows.

Is it time for s3r2?

Reshaping all columns

Hello.

How would you do this with splitstackshape?
https://stackoverflow.com/questions/41163500/r-transform-from-wide-to-long-without-sorting-columns

Transform from Wide to Long all columns with name "name_number" but without reordering/sorting the columns.

Integrate with formula

Hi, I just notice your work on stackoverflow.

Do you like to integrate the feature of expanding into formula? I have a prototype package to do this:

https://github.com/wush978/FeatureHashing

In this prototype package, I implemented an API so that:

library(FeatureHashing)
data1 <- data.frame(a = c("1,2,3", "2,3,3", "1,3", "3"), type = c("a", "b", "a", "a"), stringsAsFactors = FALSE)
interpret.tag( ~ tag(a, split = ",", type = "existence") + tag(a, split = ",", type = "count"):type, data = data1)

will produces a data.frame with expanded columns and a expanded formula to run some advaced model such as lm.

Do you want to integrate this feature?

Moreover, I notice that this way will consume lots of memory, so I am wondering if there is a way to directly convert such data.frame to sparse matrix directly. I am still working on this.

Stacked seriously CRAWLS with large datasets

Switch to a stack + data.table solution.

Example for strsplit after stacking:

for (i in 1:2) {
  DT[, paste("ind", i, sep = "_") := 
       unlist(strsplit(as.character(ind), "_"))[[i]],
     by = 1:nrow(DT)]
}

Fix the problem that can emerge with `read.table`

If you have data like this:

dat <- data.frame(v1 = c("a, b", "b, c", "c, d", "d, e", "e, f", "g, h", "i, j, k"))

R will assume that you only have two columns, so it will read the data incorrectly.

Possible solution:

Use gregexpr to count the split characters
Run sapply and max to determine the max number of columns (minus 1)
Add col.names to the read.concat() function to take care of reading the data correctly.

Option to add colnames to new columns

first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:

to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated", 
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated", 
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L, 
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA, 
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
#     Reads Sample_1 Sample_2 Sample_3  Sample_4
# 1: 470987       N2       wt     rep1 untreated
# 2: 270891       N2       wt     rep1 untreated
# 3:  56114       N2       wt     rep1 untreated
# 4: 513902       N2       wt     rep2 untreated
# 5: 310722       N2       wt     rep2 untreated
# 6:  67263       N2       wt     rep2 untreated

The new col names are not very informative, so I usually rename them in an extra step:

setnames(split,
   c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
   c("Background", "Allele", "Replicate", "Treatment")
)

This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")

Cheers.

NoSep enhancements

NoSep is very limited in its current form. Explore possibilities to split in different ways.

Add an option to create a seq_along "ID" before splitting and reshaping

See this question. Sometimes, the IDs will result in duplicated row.names so reshape(..., direction = "long") won't work.

The solution could be to first test whether there are any duplicated rows in the IDs columns, and if yes, use ave to generate a new ID with seq_along.

This should only be done when direction = "long".

fread should use encoding = "UTF-8"

The string being read in from stringi would be UTF-8 encoded. See sample data here: https://stackoverflow.com/q/13773770/1270695

CBIND dependence on LinearizeNestedList

As I am not the author of LinearizeNestedList, either contact author for inclusion in package, or find an alternative. The LinearizeNestedList function might be overkill for the purposes of CBIND.

Memory exhaustion caused by incorrect default value of "makeEqual"

When specifying multiple concatenated columns to be parsed in a big rugged data (~half a million rows), R2.15 and R3 on Linux would report memory-related error message such as "cannot allocate memory" or "long vector not supported". The workaround that I found is to explicitly set makeEqual to False, as failing to do so would generate a large number of blank rows where the number of rows for each unique identifier equals the maximum number of delimiters found among all entries. I suspect this is the source of the memory exhaustion problem.

This indicates that the default value of makeEqual, at least in my experience, is not False, as suggested by the document for the package.

Update: after coercing makeEqual to False, the result are totally off. Leaving makeEqual untouched creates memory problem for large dataset but produces the correct data after all NA rows are removed.

splitstackshape:::read.concat may open too many textConnections

Referring to http://stackoverflow.com/questions/23528882, it seems that splitstackshape:::read.concat may sometimes open too many textConnections at a time.

According to ?connections, A maximum of 128 connections can be allocated.

My titles are too long for CRAN

Must use title case, not have punctuation, and be shorter than 65 characters....

Dataframe references don't work after using splitstackshape

Let's say I want to pull up columns 3 through 5 I would use dataframe[,3:5] which works perfectly. After using splitstackshape that same command returns [1] 3, 4, 5. If I run fix(dataframe) and close I can use the references again.

Here is a snippet showing the issue (using IMDB data):

head(movies[,1:5])
X Title Year Runtime Genre
tt0000439 1 The Great Train Robbery 1903 11 Short, Western
tt0003037 2 Juve Against Fantomas 1913 61 Crime, Drama
tt0003740 3 Cabiria 1914 148 Adventure, Drama, History
tt0004707 4 Tillie's Punctured Romance 1914 82 Comedy
tt0005960 5 Regeneration 1915 72 Biography, Crime, Drama
tt0006206 6 Les vampires 1915 399 Action, Adventure, Crime

movies <- cSplit(movies, "Genre", sep=",")

head(movies[,1:5])
[1] 1 2 3 4 5

fix(movies)

head(movies[,1:5])
X Title Year Runtime Released
1 1 The Great Train Robbery 1903 11 1903-12-01
2 2 Juve Against Fantomas 1913 61 1913-10-02
3 3 Cabiria 1914 148 1914-06-01
4 4 Tillie's Punctured Romance 1914 82 1914-12-21
5 5 Regeneration 1915 72 1915-09-13
6 6 Les vampires 1915 399 1916-11-23

Need explicit `[]` for functions using `:=`

From the "data.table" readme for version 1.9.5, [] is now needed to print the results. All functions would need to be checked for this.

Stacked behaves incorrectly for var.stubs with overlapping names

Consider the following data.table:

> dt <- data.table(id=1:3, A_2001=c(1,2,3), B_2001=c(1,3,5), B_2007=c(4,3,2), AC_2007=c(8,9,10))
> dt
   id A_2001 B_2001 B_2007 AC_2007
1:  1      1      1      4       8
2:  2      2      3      3       9
3:  3      3      5      2      10

Now if we run Stacked:

> Stacked(dt, id.vars="id", var.stubs=c("A", "B", "AC"), sep="_")
$A
   id .time_1  A
1:  1    2001  1
2:  1    2007  8
3:  2    2001  2
4:  2    2007  9
5:  3    2001  3
6:  3    2007 10

$B
   id .time_1 B
1:  1    2001 1
2:  1    2007 4
3:  2    2001 3
4:  2    2007 3
5:  3    2001 5
6:  3    2007 2

$AC
   id .time_1 AC
1:  1    2007  8
2:  2    2007  9
3:  3    2007 10

The A column now has also picked up the values for AC from 2007.

This is due to the use of grep on line 2 of Stacked: any stub that is a subset of another stub, will pick up the values of that stub.

mrdwab / splitstackshape Goto Github PK

splitstackshape's People

Contributors

Stargazers

Watchers

Forkers

splitstackshape's Issues

split concatenated column by _

split the remaining concatenated part by -

Recommend Projects

Recommend Topics

Recommend Org

split concatenated column by `_`

split the remaining concatenated part by `-`