Giter VIP home page Giter VIP logo

dplyr's Introduction

dplyr

CRAN status R-CMD-check Codecov test coverage

Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

If you are new to dplyr, the best place to start is the data transformation chapter in R for Data Science.

Backends

In addition to data frames/tibbles, dplyr makes working with other computational backends accessible and efficient. Below is a list of alternative backends:

  • arrow for larger-than-memory datasets, including on remote cloud storage like AWS S3, using the Apache Arrow C++ engine, Acero.

  • dtplyr for large, in-memory datasets. Translates your dplyr code to high performance data.table code.

  • dbplyr for data stored in a relational database. Translates your dplyr code to SQL.

  • duckplyr for using duckdb on large, in-memory datasets with zero extra copies. Translates your dplyr code to high performance duckdb queries with an automatic R fallback when translation isn’t possible.

  • duckdb for large datasets that are still small enough to fit on your computer.

  • sparklyr for very large datasets stored in Apache Spark.

Installation

# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

Development version

To get a bug fix or to use a feature from the development version, you can install the development version of dplyr from GitHub.

# install.packages("pak")
pak::pak("tidyverse/dplyr")

Cheat Sheet

Usage

library(dplyr)

starwars %>% 
  filter(species == "Droid")
#> # A tibble: 6 × 14
#>   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
#>   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
#> 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
#> 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
#> 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
#> 4 IG-88     200   140 none       metal       red               15 none  masculi…
#> 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
#> # ℹ 1 more row
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

starwars %>% 
  select(name, ends_with("color"))
#> # A tibble: 87 × 4
#>   name           hair_color skin_color  eye_color
#>   <chr>          <chr>      <chr>       <chr>    
#> 1 Luke Skywalker blond      fair        blue     
#> 2 C-3PO          <NA>       gold        yellow   
#> 3 R2-D2          <NA>       white, blue red      
#> 4 Darth Vader    none       white       yellow   
#> 5 Leia Organa    brown      light       brown    
#> # ℹ 82 more rows

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)
#> # A tibble: 87 × 4
#>   name           height  mass   bmi
#>   <chr>           <int> <dbl> <dbl>
#> 1 Luke Skywalker    172    77  26.0
#> 2 C-3PO             167    75  26.9
#> 3 R2-D2              96    32  34.7
#> 4 Darth Vader       202   136  33.3
#> 5 Leia Organa       150    49  21.8
#> # ℹ 82 more rows

starwars %>% 
  arrange(desc(mass))
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jabba De…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
#> 2 Grievous     216   159 none       brown, wh… green, y…       NA   male  mascu…
#> 3 IG-88        200   140 none       metal      red             15   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Tarfful      234   136 brown      brown      blue            NA   male  mascu…
#> # ℹ 82 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(
    n > 1,
    mass > 50
  )
#> # A tibble: 9 × 3
#>   species      n  mass
#>   <chr>    <int> <dbl>
#> 1 Droid        6  69.8
#> 2 Gungan       3  74  
#> 3 Human       35  81.3
#> 4 Kaminoan     2  88  
#> 5 Mirialan     2  53.1
#> # ℹ 4 more rows

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please use community.rstudio.com or the manipulatr mailing list.


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

dplyr's People

Contributors

arunsrinivasan avatar batpigandme avatar billdenney avatar cderv avatar coolbutuseless avatar cosinequanon avatar davisvaughan avatar earowang avatar eibanez avatar hadley avatar hannes avatar ilarischeinin avatar javierluraschi avatar jennybc avatar jimhester avatar kevinushey avatar krlmlr avatar leondutoit avatar lindbrook avatar lionel- avatar maurolepore avatar mine-cetinkaya-rundel avatar pimentel avatar romainfrancois avatar s-fleck avatar salim-b avatar sfirke avatar steveharoz avatar yutannihilation avatar zeehio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dplyr's Issues

installation with cygwin

Hi did anyone tried to install dplyr on R on a windows machine?
Needless to say that the workstation is installed with cygwin including the make and gcc commands

I'm constantly getting the error:
ERROR: compilation failed for package 'dplyr'

Dave

Fail to build on Windows 64bit R-studio

  • installing source package 'dplyr' ...
    ** libs
    g++ -m64 -I"C:/PROGRA1/R/R-301.1/include" -DNDEBUG -I"C:/Users/jta/Documents/R/win-library/3.0/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -mtune=core2 -c RcppExports.cpp -o RcppExports.o
    g++ -m64 -I"C:/PROGRA1/R/R-301.1/include" -DNDEBUG -I"C:/Users/jta/Documents/R/win-library/3.0/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -mtune=core2 -c split-indices.cpp -o split-indices.o
    g++ -m64 -shared -s -static-libgcc -o dplyr.dll tmp.def RcppExports.o split-indices.o C:/Users/jta/Documents/R/win-library/3.0/Rcpp/lib/x64/libRcpp.a -Ld:/RCompile/CRANpkg/extralibs64/local/lib/x64 -Ld:/RCompile/CRANpkg/extralibs64/local/lib -LC:/PROGRA1/R/R-301.1/bin/x64 -lR
    installing to C:/Users/jta/Documents/R/win-library/3.0/dplyr/libs/x64
    ** R
    ** data
    ** inst
    ** tests
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
    Error in namespaceExport(ns, exports) :
    undefined exports: join.tbl_sqlite
    Error: loading failed
    Execution halted
    ERROR: loading failed
  • removing 'C:/Users/jta/Documents/R/win-library/3.0/dplyr'

Did try installing Rsqllite and later RSQLite.extfuns to see if it was missing dependencies regarding SQLite, but neither did solve the problem.

Have devtools package and R-tools installed

Safer escaping

  • escape_sql -> name
  • build_sql to automatically escape names/calls.

Strict version of translate_sql

Shouldn't fill in arbitrary functions - could be activated by global option. Will need considerable fill in of simple mathematical functions.

Cannot install dplyr

I'm having trouble installing the dplyr package. Im running a mac osx 10.8.5 and I've installed Xcode and the command line tools and uninstalled them and installed them again (3 or 4 times over), restarted my computer numerous times and each time I get this error:

* installing *source* package 'dplyr' ...
** libs
llvm-g++-4.2 -arch x86_64 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.0/Resources/library/Rcpp/include"   -fPIC  -mtune=core2 -g -O2  -c RcppExports.cpp -o RcppExports.o
/bin/sh: llvm-g++-4.2: command not found
make: *** [RcppExports.o] Error 127
ERROR: compilation failed for package 'dplyr'
* removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/dplyr'
Error: Command failed (1)

Any ideas? Thanks!

Installation Error Macosx

I get an error when trying to install dplyr via devtools. I also have the latest XCode (5.0)/Command Line Tools.

> install_github("dplyr")
Installing github repo(s) dplyr/master from hadley
Downloading dplyr.zip from https://github.com/hadley/dplyr/archive/master.zip
Installing package from /var/folders/tx/5yp3lm_j6_l076sm5htmvxxm0000gn/T//RtmpXuhEU8/dplyr.zip
Installing dplyr
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
  '/private/var/folders/tx/5yp3lm_j6_l076sm5htmvxxm0000gn/T/RtmpXuhEU8/dplyr-master'  \
  --library='/Library/Frameworks/R.framework/Versions/3.0/Resources/library' --with-keep.source --install-tests 

* installing *source* package 'dplyr' ...
** libs
llvm-g++-4.2 -arch x86_64 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.0/Resources/library/Rcpp/include"   -fPIC  -mtune=core2 -g -O2  -c RcppExports.cpp -o RcppExports.o
llvm-g++-4.2 -arch x86_64 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.0/Resources/library/Rcpp/include"   -fPIC  -mtune=core2 -g -O2  -c split-indices.cpp -o split-indices.o
llvm-g++-4.2 -arch x86_64 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -L/usr/local/lib -o dplyr.so RcppExports.o split-indices.o /Library/Frameworks/R.framework/Versions/3.0/Resources/library/Rcpp/lib/libRcpp.a -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
installing to /Library/Frameworks/R.framework/Versions/3.0/Resources/library/dplyr/libs
** R
** data
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error in namespaceExport(ns, exports) : 
  undefined exports: join.tbl_sqlite
Error: loading failed
Execution halted
ERROR: loading failed
* removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/dplyr'
Error: Command failed (1)

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_1.3 vimcom_0.9-8

loaded via a namespace (and not attached):
[1] digest_0.6.3   evaluate_0.4.7 httr_0.2       memoise_0.1    parallel_3.0.1 RCurl_1.95-4.1 stringr_0.6.2 
[8] tools_3.0.1    whisker_0.3-2 

Simplify generated sql

By recognising that from can be:

  • table name
  • join
  • select

And when it's a table name or a join, you can omit a layer of selects.

Implement regroup

So that you can easily ungroup, do some operation and then regroup.

Develop flexible SQL backend

Packages reviewed: RPostgreSQL, RMySQL, MonetDB.R, RODBC, RJDBC

Differences:

  • RPostgreSQL: has dbApply which could be used to implement do, multiple open connections, prepared queries use special syntax, windowing, explain output is json, subqueries need to be named, ANALYZE called automatically (but not on temporary tables)
  • MonetDB.R: looks like prepared statements might be supported through dbSendQuery, no semi_joins?
  • RMySQL: also has dbApply, no prepared queries
  • RODBC: non-DBI, manual transactions

Other things that I may need to make generic:

  • variable name escaping
  • translation of R variable types to db types (for creation)

Steps to make dplyr adapt to different sql variants:

  • Make query a base class and add subclasses for other databases
  • Turn tbl_sqlite object into a tbl_sql object - method would dispatch on src/con

Changes needed:

  • Query needs to become base class
  • Existing tbl_sqlite methods should become tbl_sql methods
  • When implementing first non-sqlite sql adaptor, gradually change

Postgresql and MonetDB clearly the most important and should be tackled first. MonetDB might be simpler to start with since it implements a more limited subset of sql.

Bigquery backend

(Some initial code at http://code.google.com/p/google-bigquery-r-client/source/browse/googlebigquery/R/bigquery_client.R)

Querying differences:

  • can't use *
  • can use within/flatten for nested data (out of scope for dplyr?)
  • table unions might be more important
  • join for small databases (<8 meg), join each for larger
  • group by requires all variables to be in select

Other types of grouping

  • bootstrap
  • binning (continuous data)
  • moving window/shingles
  • accumulating window
  • individual rows

Error install dplyr package on Mac

I have been trying to install "dplyr" using "devtools" on my Mac and keep getting this output and error.

devtools::install_github("dplyr")

Installing github repo(s) dplyr/master from hadley
Installing dplyr.zip from https://github.com/hadley/dplyr/archive/master.zip
Installing dplyr
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL
'/private/var/folders/Mn/MnBlvG3QH+eJFvQUQ38vvk+++TI/-Tmp-/RtmpwOIX5d/dplyr-master'
--library='/Library/Frameworks/R.framework/Versions/3.0/Resources/library' --with-keep.source

  • installing source package 'dplyr' ...
    ** libs
    sh: make: command not found
    ERROR: compilation failed for package 'dplyr'
  • removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/dplyr'
    Error: Command failed (1)

Here is the sessionInfo()

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] graphics grDevices utils datasets grid stats methods base

other attached packages:
[1] devtools_1.2

loaded via a namespace (and not attached):
[1] colorspace_1.2-2 dichromat_2.0-0 digest_0.6.3 evaluate_0.4.3 formatR_0.8 gtable_0.1.2
[7] httr_0.2 labeling_0.2 lattice_0.20-15 MASS_7.3-26 memoise_0.1 munsell_0.4
[13] parallel_3.0.1 plyr_1.8 proto_0.3-10 rCharts_0.3.5 RColorBrewer_1.0-5 RCurl_1.95-4.1
[19] reshape2_1.2.2 RJSONIO_1.0-3 stringr_0.6.2 tools_3.0.1 whisker_0.3-2 yaml_2.1.7

Loading "assertthat" seems to work fine using devtools::install_github("assertthat"), so I'm not sure what the error is. "plyr" is loaded, but not attached so I'm not sure if that's the issue. Maybe I'm missing something.

Mutate/summarise bugs

From Dave Cooper:

# mutate and summarize bugs.R

rm(list = ls())

require(plyr)
require(ggplot2)

d = diamonds

#########################
# MUTATE BUG
# this works
e = mutate(d,
  cut2 = as.character(cut) )
e = mutate(e,
  cut2 = ifelse(cut2 == 'Fair', '*Fairest*', cut2))
plyr::count(e, ~cut+cut2)

# this fails
e = mutate(d,
  cut2 = as.character(cut),
  cut2 = ifelse(cut2 == 'Fair', '*Fairest*', cut2))
plyr::count(e, ~cut+cut2)

# but this works!
e = mutate(d,
  cut2 = as.character(cut),
  cut3 = ifelse(cut2 == 'Fair', '*Fairest*', cut2))
plyr::count(e, ~cut+cut2+cut3)

#########################
# SAME PROBLEM, BUT WITH SUMMARIZE
# this works
e = summarize(d,
  cut2 = as.character(cut) )
e = summarize(e,
  cut2 = ifelse(cut2 == 'Fair', '*Fairest*', cut2))
count(e, ~cut2)

# this fails
e = summarize(d,
  cut2 = as.character(cut),
  cut2 = ifelse(cut2 == 'Fair', '*Fairest*', cut2))
count(e, ~cut2)

# but this works!
e = summarize(d,
  cut2 = as.character(cut),
  cut3 = ifelse(cut2 == 'Fair', '*Fairest*', cut2))
count(e, ~cut2+cut3)


#########################
# MUTATE with REVALUE
# this works
e = mutate(d,
 cut2 = revalue(cut, c(Fair = '*Fairest*')) )
e = mutate(e,
 cut2 = revalue(cut2, c('*Fairest*' = '*Fairest of All*')) )
count(e, ~cut+cut2)

# this fails
e = mutate(d,
 cut2 = revalue(cut, c(Good = '*Fairest*')),
 cut2 = revalue(cut2, c('*Fairest*' = '*Fairest of All*')) )
count(e, ~cut+cut2)

# but this works!
e = mutate(d,
 cut2 = revalue(cut, c(Good = '*Fairest*')),
 cut3 = revalue(cut2, c('*Fairest*' = '*Fairest of All*')) )
count(e, ~cut+cut2+cut3)

Failed to install

devtools::install_github("dplyr")

...
** testing if installed package can be loaded
Error : object 'as.data.table' not found whilst loading namespace 'dplyr'
Error: loading failed
Execution halted
ERROR: loading failed
* removing 'D:/Software/R/R/win-library/3.0/dplyr'
Error: Command failed (1)

sessionInfo()

R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
[1] devtools_1.3   digest_0.6.3   evaluate_0.5.1 httr_0.2       memoise_0.1   
[6] parallel_3.0.2 RCurl_1.95-4.1 stringr_0.6.2  tools_3.0.2    whisker_0.3-2 

RTools 3.0

I quickly looked in the code, but could find nothing strange. Any ideas?

Printing sources should display succinct column info

From Dave Cooper:

h = function(x, r=6, c=8) {
  if (is.null(dim(x))) {
    cat(format(x[1:min(r, length(x))], digits=2), '...')
    cat('\nlength=', length(x), ', class=', class(x), '\n', sep='')
  }
  else {
    print(x[1:min(r, nrow(x)), 1:min(c, ncol(x))], digits=2)
    cat('dim = (', dim(x)[1], ', ', dim(x)[2], '), class=', class(x), '\n', sep='')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.