Giter VIP home page Giter VIP logo

splitstackshape's People

Contributors

mrdwab avatar sritchie73 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

splitstackshape's Issues

Possible bug in dealing with factor columns?

I was attempting to answer this SO question using:

require(splitstackshape)
merged.stack(d, id="PID", var.stubs=c("Cue"), sep="var.stubs")
#     PID .time_1 Cue
#  1:   1       1   1
#  2:   1       1   3
#  3:   1       1   5
#  4:   1       2   2
#  5:   1       2   5
#  6:   1       2   5
#  7:   2       1   1
#  8:   2       1   3
#  9:   2       1   5
#10:   2       2   2
#11:   2       2   5
#12:   2       2   5

Unless I'm mistaken as to what the code is supposed to do, this isn't the right output.

sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] splitstackshape_1.4.2 data.table_1.9.5     

# loaded via a namespace (and not attached):
# [1] chron_2.3-45 tools_3.1.2 

throw error when the separator is ''

The concatenated column, 'response', in my data.table,'dt.gdoc' contains number string with no separator, e.g. "13464". I wanted to split this column into a list using

cSplit_l(dt.gdoc,'response',sep='',drop=T)

However, when I attempted to run this code, I got the following error:

Error in gsub(sprintf("\s+[%s]\s+|\s+[%s]|[%s]\s+", delim, delim,  : 
  invalid regular expression '\s+[]\s+|\s+[]|[]\s+', reason 'Missing ']''

Make functions compatible with `tbl_df` objects

Demo:

library(splitstackshape)
library(dplyr)

CT <- tbl_df(head(concat.test))

CT %>% cSplit("Likes")
# Error in `[.tbl_df`(indt, , splitCols, with = FALSE) : 
#   unused argument (with = FALSE)

CT %>% data.frame %>% cSplit("Likes")
#      Name                   Siblings    Hates Likes_1 Likes_2 Likes_3 Likes_4 Likes_5
#1:   Boyd Reynolds , Albert , Ortega     2;4;       1       2       4       5       6
#2:  Rufus  Cohen , Bert , Montgomery 1;2;3;4;       1       2       4       5       6
#3:   Dana                     Pierce       2;       1       2       4       5       6
#4: Carole Colon , Michelle , Ballard     1;4;       1       2       4       5       6
#5: Ramona           Snyder , Joann ,   1;2;3;       1       2       5       6      NA
#6: Kelley          James , Roxanne ,     1;4;       1       2       5       6      NA

See http://stackoverflow.com/questions/32173746/error-running-csplit-when-splitstackshape-data-frame-and-tidyr-dplyr-are-loaded

conversion checklist

Check for update/compatibility with cSplit and replace where possible.

  • concat.split.compact
  • concat.split.expanded
  • concat.split.list
  • concat.split.multiple

Other work:

  • Create aliases for concat.split.expanded and concat.split.list.
  • Include concat.split.DT (https://gist.github.com/mrdwab/6873058).
  • Include stratified (https://gist.github.com/mrdwab/933ffeaa7a1d718bd10a).
  • Move vGrep out as a non-exported function since it is used in a couple of places.
  • getanID needs to be made more robust.
  • Check to make sure that we are not overwriting existing data.tables (by reference) when we use :=. Something like if (!is.data.table(dataset)) dataset <- as.data.table(dataset) else dataset <- copy(dataset) might be safer.

Explore:

  • Can the id.vars be made redundant?
  • Can cSplit be modified to accept vectors as its input? Don't see an obvious way...
  • Are there utilities that can be brought in from my other packages (expandRows, for example)?

There is no reason for `Stacked` to create a 1 item list

Using the data from example(Stacked), note that if you do:

set.seed(1)
mydf <- data.frame(id_1 = 1:6, id_2 = c("A", "B"),
                   varA.1 = sample(letters, 6),
                   varA.2 = sample(letters, 6),
                   varA.3 = sample(letters, 6),
                   varB.2 = sample(10, 6),
                   varB.3 = sample(10, 6),
                   varC.3 = rnorm(6))
mydf
Stacked(data = mydf, id.vars = c("id_1", "id_2"),
        var.stubs = c("varA", "varB", "varC"),
        sep = "\\.")

you'd get a list (as expected).

You'd also get a list if you did:

Stacked(data = mydf, id.vars = c("id_1", "id_2"), 
        var.stubs = "varA", sep = "\\.")

This doesn't make as much sense...

Switch from read.table to fread?

Perhaps something like this will work:

csDataTable <- function(dataset, splitcol, sep, drop = FALSE) {
  if (is.numeric(splitcol)) splitcol <- names(dataset)[splitcol]
  x <- getwd()
  setwd(tempdir())
  writeLines(as.character(dataset[[splitcol]]), "temp.txt")
  Split <- fread("temp.txt", sep=sep)
  setnames(Split, paste(splitcol, seq_along(Split), sep = "_"))
  setwd(x)
  if (!is.data.table(dataset)) dataset <- data.table(dataset)
  if (isTRUE(drop)) dataset[, eval(splitcol) := NULL]
  cbind(dataset, Split)
}

See: http://stackoverflow.com/q/19228870/1270695

`concat.split.expanded` for character variables

First, thanks for this package, it can be very useful, and the code and documentation are very clean.

There is one thing I miss in terms of features : the ability to have a concat.split.expanded for character variables. Suppose you deal with a variable listing the preferred music styles of some people. You end up with something like this :

               music
1       rock,electro
2            electro
3 rock,karadi rhymes

If you want to reuse these variables, for example for crossing with the gender or age, I find it useful to transform it in the following way :

  music.electro music.rock music.karadi_rhymes
1             1          1                   0
2             1          0                   0
3             0          1                   1

A sort of concat.split.expanded for characters in binary mode. As such, you can cross tabulate table(music.electro, gender), etc.

That's what the multi.split function from my questionr package does, but it is not as well written and as featureful as your concat.split. That's why I wondered if you think it could be implemented in splitstackshape, or if it would be incompatible with other functions such as Reshape.

Thanks !

Multiple Separators for the same file input R

Hey Ananda Mahto,

I've titled it the same as in stack.
http://stackoverflow.com/questions/20075135/multiple-separators-for-the-same-file-input-r

Cawarnold / splitstring_-df3 / Mobile-App

R code I tried.

df <- results8[150:200,]
View(df)
library(splitstackshape)

split concatenated column by _

df1 <- concat.split(data = df, split.col = "V3", sep = "_", drop = TRUE)

split the remaining concatenated part by -

df2 <- concat.split(data = df1, split.col = "V3_5", sep = "-", drop = TRUE)
View(df2)

cSplit: strange behaviour when only TT in column

Hi,

I noticed some strange behavior using cSplit.

My input (data) looks like this:

head(data)
      SNP1 SNP2 SNP3
1 AKL001   TT   CC   TT
2 AKL002   TT   CC   TT
3 BSC001   TT   CC   TT
4 BSC080   CT   CC   TT
5 BSC087   TT   CC   TT
6 BSC114   CT   CC   TT

df<-cSplit(data, grep("SNP", names(data)) , "", stripWhite = FALSE)

and the result is

head(df)
       V1 SNP1_1 SNP1_2 SNP2_1 SNP2_2 SNP3_1 SNP3_2
1: AKL001      T      T      C      C   TRUE   TRUE
2: AKL002      T      T      C      C   TRUE   TRUE
3: BSC001      T      T      C      C   TRUE   TRUE
4: BSC080      C      T      C      C   TRUE   TRUE
5: BSC087      T      T      C      C   TRUE   TRUE
6: BSC114      C      T      C      C   TRUE   TRUE

so it shows weird behavior when there is only TT present in the column. Any help would be appreciated!

Whitespace regex and '_s' interaction

I'm seeing some weird interactions between using a regex to match on whitespacing and the characters '_s'.

d <- data.frame(id=1,list=c('a b_s ccc s_ddd eee_s fff s_g s_hhh iii'))
cSplit(d, c('list'), sep = "\\s", direction = "long", fixed = FALSE,
                 drop = TRUE, stripWhite = TRUE, makeEqual = FALSE,
                 type.convert = FALSE)
# WRONG:
#   id             list
#1:  1                a
#2:  1      b_scccs_ddd
#3:  1 eee_sfffs_gs_hhh
#4:  1              iii

# Contrast with
cSplit(d, c('list'), sep = " ", direction = "long", fixed = TRUE,
               drop = TRUE, stripWhite = TRUE, makeEqual = FALSE,
               type.convert = FALSE)
# As expected: 
#   id  list
#1:  1     a
#2:  1   b_s
#3:  1   ccc
#4:  1 s_ddd
#5:  1 eee_s
#6:  1   fff
#7:  1   s_g
#8:  1 s_hhh
#9:  1   iii
 

Lower data.table version dependency level?

Instead of 1.10.5, can lower to 1.10.4 and maybe use hasArg(logical10) (or packageVersion("data.table") < "1.10.5" or something) to decide whether to include the argument or not in fread.

Not sure if that's a good idea or not....

The stratified function gives surprising results

I didn't expect to get a result like the one below:

df<-data.frame(x=c(1,1,2,2,2,7),t=1:6)
df
x t
1 1 1
2 1 2
3 2 3
4 2 4
5 2 5
6 7 6
stratified(df,group=1,size=1)
x t
1: 1 1
2: 2 5
3: 1 1

I expected to get exactly one row of values from stratum "1", "2", and "7", respectively. Instead, I got two from "1", one from "2", and none from "7".

How is this possible? Have I misunderstood something?

Best regards,
Magnus

record is lost if splitCols contains empty value

Example of code to reproduce:

dt1 <- fread("V1 V2 V3
             x b;c;d 1
             y d;ef  2
             z d;ef  3
             m tmp   4
             n tmp   5")
dt1[4, V2:=''] # this record will be lost because V2 value is empty string
dt1[5, V2:=NA_character_] # NA value is processed correctly

cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long')

as you can see, record 4 (where V2=='') is lost.

cSplit_e from splitstackshape package not accounting for NA's?

Following my SO post here, I would appreciate if you could fix the bug.

data1<-structure(list(reason = c("1", "1", NA, "1", "1", "4 5", "1", 
"1", "1", "1", "1", "1 2 3 4", "1 2 5", NA, NA)), .Names = "reason", class = "data.frame", row.names = c(NA, 
-15L))

 #loading packages
 library(data.table)
 library(splitstackshape)

cSplit_e(setDT(data1),1," ",mode = "value") # with NA's doesn't work

Error in seq.default(min(vec), max(vec)) : 'from' must be a finite number

data2<-na.omit(setDT(data1),cols="reason") # removing NA's 

cSplit_e(data2,1," ",mode = "value") # without NA's works
     reason reason_1 reason_2 reason_3 reason_4 reason_5
 1:       1        1       NA       NA       NA       NA
 2:       1        1       NA       NA       NA       NA
 3:       1        1       NA       NA       NA       NA
 4:       1        1       NA       NA       NA       NA
 5:     4 5       NA       NA       NA        4        5
 6:       1        1       NA       NA       NA       NA
 7:       1        1       NA       NA       NA       NA
 8:       1        1       NA       NA       NA       NA
 9:       1        1       NA       NA       NA       NA
10:       1        1       NA       NA       NA       NA
11: 1 2 3 4        1        2        3        4       NA
12:   1 2 5        1        2       NA       NA        5

Add an option for `sep=var.stubs` or something similar

Because of how the function is coded, if you had a set of variables named, say, incA, valA, incB, valB, setting sep = "inc|val" will work to strip away that part of the variable name, thus leaving us with A and B as the times (as we would want).

Perhaps explore whether this can be made part of NoSep.

charBinaryMat improvements

The current charBinaryMat can be made much faster with matrix indexing:

charBinaryMat <- function(listOfValues, fill = NA) {
  A <- unlist(listOfValues, use.names = FALSE)
  lev <- sort(unique(A))
  m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev),
              dimnames = list(NULL, lev))
  Row <- vapply(listOfValues, length, 1L)
  Row <- rep(seq_along(Row), Row)
  Col <- match(A, lev)
  m[cbind(Row, Col)] <- 1L
  m
}

Some sample data to test it with:

set.seed(1)
A = sample(10, 100000, replace = TRUE)
str <- sapply(seq_along(A), function(x)
  paste(sample(LETTERS[1:10], A[x]), collapse = " "))

lov <- strsplit(str, " ", fixed=TRUE)

Functions to add

  • move_me
  • dist2df
  • [.array
  • [.ftable
  • ftable2dt
  • dupe_thresh
  • list_unlister
  • make_me_NA
  • na_last (add a "row + col" option)
  • Riffle (but it needs a LOT of work and testing)
  • shifter, shuffler (?)
  • sort_ends (? -- How useful?)
  • tabulate_int (after modifying to work with factors/characters)
  • unlist_by_row/unlist_by_col
  • lwSplit (name?) for the act of splitting once to a "long" form and then again to a "wide" form? Eg: https://stackoverflow.com/q/49182838/1270695, https://stackoverflow.com/q/29120787/1270695

blank.lines.skip

An option for blank.lines.skip needs to be added to read.concat.

how cSplit() treats multiple splitCols when they contain different number of fields

Here is example table:

dt1 <- fread("V1 V2       V3
              x  xA;xB;xC x1;x2;x3
              y  yD       y1
              z  zF;zG    z1")

and I want to split it by both V2 and V3 columns. You can see that the last record is "wrong": V2 has 2 values while V3 has only one. And that how cSplit() treats those cases:

# with default arguments:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long')
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

# with `makeEqual = TRUE`:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = T)
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

So, by default it works like with makeEqual = TRUE while in the help it is said Defaults to FALSE. Then I tried with FALSE:

cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = F)
# Warning in `[.data.table`(indt, , `:=`(eval(splitCols), lapply(X, function(x) { :
#     Supplied 5 items to be assigned to 6 items of column 'V3' (recycled leaving remainder of 1 items).
#      V1 V2 V3
#   1:  x xA x1
#   2:  x xB x2
#   3:  x xC x3
#   4:  y yD y1
#   5:  z zF z1
#   6:  z zG x1

It recycles V3 elements but it takes it from another group which is kinda unexpected. I think it would be more logical to give one of the following outputs:

# without recycling, fill with NA:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG NA

# with recycling:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG z1

Using concat.split on data containing an apostrophe causes error

concat.split doesn't appear to like apostrophes in the data column.

> df <- data.frame(experience=c("Did Use, Didn't like"))
> df
            experience
1 Did Use, Didn't like
> concat.split(df, 1)
Error in FUN(NA_integer_[[1L]], ...) : 
  argument must be coercible to non-negative integer

If you try it without, it works fine.

I installed it from the package manager, appears to be v1.2.0.

Mac OS X x64, Mavericks

R 3.0.2

Integrate with formula

Hi, I just notice your work on stackoverflow.

Do you like to integrate the feature of expanding into formula? I have a prototype package to do this:

https://github.com/wush978/FeatureHashing

In this prototype package, I implemented an API so that:

library(FeatureHashing)
data1 <- data.frame(a = c("1,2,3", "2,3,3", "1,3", "3"), type = c("a", "b", "a", "a"), stringsAsFactors = FALSE)
interpret.tag( ~ tag(a, split = ",", type = "existence") + tag(a, split = ",", type = "count"):type, data = data1)

will produces a data.frame with expanded columns and a expanded formula to run some advaced model such as lm.

Do you want to integrate this feature?


Moreover, I notice that this way will consume lots of memory, so I am wondering if there is a way to directly convert such data.frame to sparse matrix directly. I am still working on this.

Stacked seriously CRAWLS with large datasets

Switch to a stack + data.table solution.

Example for strsplit after stacking:

for (i in 1:2) {
  DT[, paste("ind", i, sep = "_") := 
       unlist(strsplit(as.character(ind), "_"))[[i]],
     by = 1:nrow(DT)]
}

Fix the problem that can emerge with `read.table`

If you have data like this:

dat <- data.frame(v1 = c("a, b", "b, c", "c, d", "d, e", "e, f", "g, h", "i, j, k"))

R will assume that you only have two columns, so it will read the data incorrectly.

Possible solution:

  1. Use gregexpr to count the split characters
  2. Run sapply and max to determine the max number of columns (minus 1)
  3. Add col.names to the read.concat() function to take care of reading the data correctly.

Option to add colnames to new columns

first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:

to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated", 
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated", 
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L, 
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA, 
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
#     Reads Sample_1 Sample_2 Sample_3  Sample_4
# 1: 470987       N2       wt     rep1 untreated
# 2: 270891       N2       wt     rep1 untreated
# 3:  56114       N2       wt     rep1 untreated
# 4: 513902       N2       wt     rep2 untreated
# 5: 310722       N2       wt     rep2 untreated
# 6:  67263       N2       wt     rep2 untreated

The new col names are not very informative, so I usually rename them in an extra step:

setnames(split,
   c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
   c("Background", "Allele", "Replicate", "Treatment")
)

This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")

Cheers.

NoSep enhancements

NoSep is very limited in its current form. Explore possibilities to split in different ways.

CBIND dependence on LinearizeNestedList

As I am not the author of LinearizeNestedList, either contact author for inclusion in package, or find an alternative. The LinearizeNestedList function might be overkill for the purposes of CBIND.

Memory exhaustion caused by incorrect default value of "makeEqual"

When specifying multiple concatenated columns to be parsed in a big rugged data (~half a million rows), R2.15 and R3 on Linux would report memory-related error message such as "cannot allocate memory" or "long vector not supported". The workaround that I found is to explicitly set makeEqual to False, as failing to do so would generate a large number of blank rows where the number of rows for each unique identifier equals the maximum number of delimiters found among all entries. I suspect this is the source of the memory exhaustion problem.

This indicates that the default value of makeEqual, at least in my experience, is not False, as suggested by the document for the package.

Update: after coercing makeEqual to False, the result are totally off. Leaving makeEqual untouched creates memory problem for large dataset but produces the correct data after all NA rows are removed.

Dataframe references don't work after using splitstackshape

Let's say I want to pull up columns 3 through 5 I would use dataframe[,3:5] which works perfectly. After using splitstackshape that same command returns [1] 3, 4, 5. If I run fix(dataframe) and close I can use the references again.

Here is a snippet showing the issue (using IMDB data):

head(movies[,1:5])
X Title Year Runtime Genre
tt0000439 1 The Great Train Robbery 1903 11 Short, Western
tt0003037 2 Juve Against Fantomas 1913 61 Crime, Drama
tt0003740 3 Cabiria 1914 148 Adventure, Drama, History
tt0004707 4 Tillie's Punctured Romance 1914 82 Comedy
tt0005960 5 Regeneration 1915 72 Biography, Crime, Drama
tt0006206 6 Les vampires 1915 399 Action, Adventure, Crime

movies <- cSplit(movies, "Genre", sep=",")

head(movies[,1:5])
[1] 1 2 3 4 5

fix(movies)

head(movies[,1:5])
X Title Year Runtime Released
1 1 The Great Train Robbery 1903 11 1903-12-01
2 2 Juve Against Fantomas 1913 61 1913-10-02
3 3 Cabiria 1914 148 1914-06-01
4 4 Tillie's Punctured Romance 1914 82 1914-12-21
5 5 Regeneration 1915 72 1915-09-13
6 6 Les vampires 1915 399 1916-11-23

Stacked behaves incorrectly for var.stubs with overlapping names

Consider the following data.table:

> dt <- data.table(id=1:3, A_2001=c(1,2,3), B_2001=c(1,3,5), B_2007=c(4,3,2), AC_2007=c(8,9,10))
> dt
   id A_2001 B_2001 B_2007 AC_2007
1:  1      1      1      4       8
2:  2      2      3      3       9
3:  3      3      5      2      10

Now if we run Stacked:

> Stacked(dt, id.vars="id", var.stubs=c("A", "B", "AC"), sep="_")
$A
   id .time_1  A
1:  1    2001  1
2:  1    2007  8
3:  2    2001  2
4:  2    2007  9
5:  3    2001  3
6:  3    2007 10

$B
   id .time_1 B
1:  1    2001 1
2:  1    2007 4
3:  2    2001 3
4:  2    2007 3
5:  3    2001 5
6:  3    2007 2

$AC
   id .time_1 AC
1:  1    2007  8
2:  2    2007  9
3:  3    2007 10

The A column now has also picked up the values for AC from 2007.

This is due to the use of grep on line 2 of Stacked: any stub that is a subset of another stub, will pick up the values of that stub.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.