tidyverse / stringr Goto Github PK

View Code? Open in Web Editor NEW

577.0 37.0 180.0 5.1 MB

A fresh approach to string manipulation in R

Home Page: https://stringr.tidyverse.org

License: Other

R 99.60% CSS 0.21% JavaScript 0.18%

r strings regular-expression

stringr's Introduction

stringr

Overview

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provides a cohesive set of functions designed to make working with strings as easy as possible. If you’re not familiar with strings, the best place to start is the chapter on strings in R for Data Science.

stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine. If you find that stringr is missing a function that you need, try looking in stringi. Both packages share similar conventions, so once you’ve mastered stringr, you should find stringi similarly easy to use.

Installation

# The easiest way to get stringr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just stringr:
install.packages("stringr")

Cheatsheet

Usage

All functions in stringr start with str_ and take a vector of strings as the first argument:

x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x) 
#> [1] 3 5 5 5 4 9
str_c(x, collapse = ", ")
#> [1] "why, video, cross, extra, deal, authority"
str_sub(x, 1, 2)
#> [1] "wh" "vi" "cr" "ex" "de" "au"

Most string functions work with regular expressions, a concise language for describing patterns of text. For example, the regular expression "[aeiou]" matches any single character that is a vowel:

str_subset(x, "[aeiou]")
#> [1] "video"     "cross"     "extra"     "deal"      "authority"
str_count(x, "[aeiou]")
#> [1] 0 3 1 2 2 4

There are seven main verbs that work with patterns:

str_detect(x, pattern) tells you if there’s any match to the pattern:
```
str_detect(x, "[aeiou]")
#> [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
```
str_count(x, pattern) counts the number of patterns:
```
str_count(x, "[aeiou]")
#> [1] 0 3 1 2 2 4
```

str_subset(x, pattern) extracts the matching components:

str_subset(x, "[aeiou]")
#> [1] "video"     "cross"     "extra"     "deal"      "authority"

str_locate(x, pattern) gives the position of the match:

str_locate(x, "[aeiou]")
#>      start end
#> [1,]    NA  NA
#> [2,]     2   2
#> [3,]     3   3
#> [4,]     1   1
#> [5,]     2   2
#> [6,]     1   1

str_extract(x, pattern) extracts the text of the match:

str_extract(x, "[aeiou]")
#> [1] NA  "i" "o" "e" "e" "a"

str_match(x, pattern) extracts parts of the match defined by parentheses:

# extract the characters on either side of the vowel
str_match(x, "(.)[aeiou](.)")
#>      [,1]  [,2] [,3]
#> [1,] NA    NA   NA  
#> [2,] "vid" "v"  "d" 
#> [3,] "ros" "r"  "s" 
#> [4,] NA    NA   NA  
#> [5,] "dea" "d"  "a" 
#> [6,] "aut" "a"  "t"

str_replace(x, pattern, replacement) replaces the matches with new text:

str_replace(x, "[aeiou]", "?")
#> [1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"

str_split(x, pattern) splits up a string into multiple pieces:

str_split(c("a,b", "c,d,e"), ",")
#> [[1]]
#> [1] "a" "b"
#> 
#> [[2]]
#> [1] "c" "d" "e"

As well as regular expressions (the default), there are three other pattern matching engines:

fixed(): match exact bytes
coll(): match human letters
boundary(): match boundaries

RStudio Addin

The RegExplain RStudio addin provides a friendly interface for working with regular expressions and functions from stringr. This addin allows you to interactively build your regexp, check the output of common string matching functions, consult the interactive help pages, or use the included resources to learn regular expressions.

This addin can easily be installed with devtools:

# install.packages("devtools")
devtools::install_github("gadenbuie/regexplain")

Compared to base R

R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.

Uses consistent function and argument names. The first argument is always the vector of strings to modify, which makes stringr work particularly well in conjunction with the pipe:
```
letters %>%
  .[1:10] %>% 
  str_pad(3, "right") %>%
  str_c(letters[2:11])
#>  [1] "a  b" "b  c" "c  d" "d  e" "e  f" "f  g" "g  h" "h  i" "i  j" "j  k"
```
Simplifies string operations by eliminating options that you don’t need 95% of the time.
Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs.

Learn more in vignette("from-base")

stringr's People

Contributors

Stargazers

Watchers

Forkers

kohske andrie tdhock jiho kevinushey jonclayden hal2001 qlycool mplourde marcds seanchrismurphy nachti codingmatrix craigcitro jcheng5 hyiltiz parthasen darwinawardwinner luckyrandom smbache dannykugler amecostantini kghub jasonelaw gshotwell evanwang1990 rodmorley huangrh mastratton3 jess28 cjtexas isaac1989 riyazpanjwani analyticalmonk korterling javierluraschi t-kalinowski rlugojr waqasm86 ahoho ethankale tjmahr zhaojk manishgodse zenabbohra duarteguilherme devinderkaur proffancypants suraggupta mgirlich urughodu richierocks xiaofanzhao yutannihilation jonmcalder cderv xkdog shians tonydurst djrobust nokome trickytank jameshowison nvon hwaheeed florisvdh nemochina2008 landesbergn lizl90 82ndairbornediv einabadi-sh nanaakwasiabayieboateng miachol austin3dickey amhrasmussen ewallace rkahne crs562 batpigandme pachevalier gadenbuie gong-yuan vanessadarthy rajesh16702 domfelacre biodavidjm jrnold jaeseokjeong jonocarroll schuetst seanpor pirategrunt epius bichlieu0705 kevinykuo ivyzhang159 osorensen jefflieng chiyahn scottkbrown

stringr's Issues

unexpected behavior when passing "\" to str_sub

Issue:

I am not sure if this is expected behavior, but I was testing out some possible solutions to a question on stackoverflow and got the following behavior:

Apologies in advance for abusing the function. I hope it helps.

Examples:

> str_length("\123")
[1] 1
> str_sub("\123", -1)
[1] "S"
> str_sub("\123", -20)
[1] "S"
> str_sub("\123123", end = -11)
[1] ""
> str_sub("\123123", end = -1)
[1] "S123"
> str_sub("\001", end = -1)
[1] "\001"
> str_sub("\001", end = -0)
[1] ""
> str_sub("\001", end = -2)
[1] ""
> str_sub("\001", end = 1)
[1] "\001"
> str_sub("\001", end = 6)
[1] "\001"
> str_sub("\00001", end = 6)
Error: embedded nul in string: '\001'

related feature request:

Options to simplify handling escapes would be a great feature; e.g. see these other SO questions:

Can R paste() output “\”?

Replacing escaped double quotes by double quotes in R"

How to gsub('%', '%', … in R?

`str_match` does not work well with non-capturing groups

str_match(state.name, "^(?:Ala|Mas).*(.)$")[1:3,]
[,1] [,2] [,3]
[1,] "Alabama" "a" "Alabama"
[2,] "Alaska" "a" "Alaska"
[3,] NA NA NA
Warning message:
In rbind(c("Alabama", "a"), c("Alaska", "a"), c(NA_character_, NA_character_, :
number of columns of result is not a multiple of vector length (arg 1)

The problem appears to be that for non-matching rows, the number of matches is counted to include the non-matching group. I think this is the problem line since this will count the non-capturing parenthesis:

n <- str_length(str_replace_all(tmp, "[^(]", "")) + 1

A possible fix is to add this line just before to remove the non-capturing paren:

tmp <- str_replace_all(tmp, "(?:", "")

which appears, to work, but I have not tested thoroughly at all.

This is on version 0.6.2

str_match mistakes "(" in a character class for the beginning of a group

The group identification behavior in str_match requires the ( character to be escaped in character classes, in contrast to the group identification behavior in base R.

For example, with gsub,capturing a ( in the group does not require escaping it if it is in a character class:

gsub("([(]...[)])","123", c("(abc)", "xyz"))
 [1] "123" "xyz"

but it does with str_match

str_match(c("(abc)", "xyz"), "([(]...[)])")
      [,1]    [,2]    [,3]
 [1,] "(abc)" "(abc)" "(abc)"
 [2,] NA      NA      NA
 Warning message:
 In rbind(c("(abc)", "(abc)"), c(NA_character_, NA_character_, NA_character_ :
   number of columns of result is not a multiple of vector length (arg 1)

While it is possible to get around this explicitly escaping \\( like this

str_match(c("(abc)", "xyz"), "([\\(]...[\\)])")

the documentation says that the syntax should be consistent with base R.

Possible unintended behaviour of invert_match

If you use str_locate_all() on a string with consecutive matches, e.g.
str_locate_all(c("hello"), c("l"))
and then try to invert_match() it, you get row of the resultant matrix which is potentially problematic:
invert_match(str_locate_all(c("hello"), c("l"))[[1]])
gives

     start end
[1,]     0   2
[2,]     4   3
[3,]     5  -1

That row 2 is odd: the string start at position 4 and ends at position 3. Perhaps this behaviour is intended but perhaps not. I would have expected the matrix to be 2x2, since there are two regions with non-matched characters, "he" and "o". The zero-length "match" between the "l"s could be unexpected for some users.
I believe that this case should at least be addressed in the help file to manage users' expectations of how the function behaves, or possibly corrected if the function isn't intended to produce that sort of result.

R CMD check failed: The requested ICU resource file cannot be found

Travis has been failing but your Travis script failed to capture the failure for some reason: https://travis-ci.org/hadley/stringr/builds/61152178 I noticed it because my knitr repo started to fail after stringr was upgraded (https://travis-ci.org/yihui/knitr/jobs/61492954):

The requested ICU resource file cannot be found. Possible problem: ICU data has not been downloaded yet. Call stri_install_check(). (U_FILE_ACCESS_ERROR)

stringr has the same error, which I don't completely understand. This may be related to #52.

BTW, you most recent check also failed (for a different reason): https://travis-ci.org/hadley/stringr/builds/61467874 and Travis failed to capture it, either.

str_sort and str_reverse

I think stringr will be better for having two functions added to it:

str_sort to sort each element of a string, e.g. str_sort(c("cba", "zxy", "fge")) will return c("abc", "xyz", "efg")
str_reverse to reverse the characters in each string, e.g. str_reverse(c("abcde", "fghij") will return c("edcba", "jihgf")

I am prepared to contribute the functions, documentation and test_that code if you think this is a good idea.

str_match with non-capturing groups is broken in release version

This minimal case demonstrates the problem (a bunch of non-capturing groups have been appended to make the problem obvious). The problem only occurs when at least one string does not match.

library(magrittr)
library(stringr)

x <- c("A_B_C", "THIS DOES NOT MATCH")
matcher <- regexec("(A)_(B)_(?:C)", x)
matches <- regmatches(x, matcher) %>% print
x %>% str_match("(A)_(B+)_(?:C)(?:.)?(?:.)?(?:.)?(?:.)?(?:.)?(?:.)?(?:.)?")

The last line produces this resuit:

> x %>% str_match("(A)_(B+)_(?:C)(?:.)?(?:.)?(?:.)?(?:.)?(?:.)?(?:.)?(?:.)?")
     [,1]    [,2] [,3] [,4]    [,5] [,6] [,7]    [,8] [,9] [,10]   [,11]
[1,] "A_B_C" "A"  "B"  "A_B_C" "A"  "B"  "A_B_C" "A"  "B"  "A_B_C" "A"  
[2,] NA      NA   NA   NA      NA   NA   NA      NA   NA   NA      NA   
Warning message:
In rbind(c("A_B_C", "A", "B"), c(NA_character_, NA_character_, NA_character_,  :
  number of columns of result is not a multiple of vector length (arg 1)

The bug is in these lines:

    tmp <- str_replace_all(pattern, "\\\\\\(", "")
    n <- str_length(str_replace_all(tmp, "[^(]", "")) + 1

which attempt to count the number of number of capture groups, but fail to exclude non-capturing groups. (Thinking about it, they probably also fail to include a capture group preceded by an even number of backslashes.)

I realize this code has all been replaced by stringi in the devel version, but if you're still maintaining the release version, it would be good to fix this.

Lookaheads might be unsupported?

I'm trying to pull the first 2 characters before an underscore out of a string, if an underscore exists. So, for:

mystr <- "cp_awesome"

I'm just trying to get "cp"

> str_extract(mystr, "[a-z]{2}(?=_)")
Error in regexpr("[a-z]{2}(?=_)", "cp_awesome", fixed = FALSE,  : 
  invalid regular expression '[a-z]{2}(?=_)', reason 'Invalid regexp'

fails, but

> str_extract(mystr, "[a-z]{2}(?:_)")
[1] "cp_"

Succeeds, but operates as a grouping param instead. stringr seems to be rejecting the (?=) syntax.

Vectorisation problems

(From Stavros)

Your email said "all functions now vectorised with respect to string, pattern (and where appropriate) replacement parameters".

The doc for str_extract does not reflect this change; it says: "'pattern' should be a single pattern", though in fact it does vectorize over pattern:

> str_extract(c('abc'),c('.','..'))
[1] "a"  "ab"

On the other hand, str_extract_all is buggy:

> str_extract_all(c('abcd'),c('.','..'))
[[1]]
[1] "a" "b" "c" "d"          <<<<<<<< what happened to the matches for '..'?

But when we duplicate the string part, we get the correct result:

> str_extract_all(rep(c('abcd'),2),c('.','..'))
[[1]]
[1] "a" "b" "c" "d"

[[2]]
[1] "ab" "cd"

In str_match, the doc says that pattern should be a single pattern, and I get an error message if it isn't, but the result seems to use both patterns:

> str_match(c('abc','xy'),c('(.)','(..)'))
     [,1] [,2]
[1,] "a"  "a" 
[2,] "xy" "xy"
Warning messages:
1: In if (n == 0) { :
  the condition has length > 1 and only the first element will be used
2: In seq_len(n) : first element used of 'length.out' argument

check fails

Hi Hadley!

Trying to implement my two little functions to stringr (as discussed per mail some time ago), I found the following problem checking the original version first:
If i build stringr using RStudio, everything works as expected, but the check fails throwing the following error:

==> roxygenize('.', roclets=c('rd', 'collate', 'namespace'))

* checking for changes ... ERROR

Error in stri_replace_all_regex(string, pattern, replacement, vectorize_all = vec,  : 
  Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)

Using the command line R CMD check fails with this output:

...
* checking examples ... ERROR
Running examples in ‘stringr-Ex.R’ failed
The error most likely occurred in:

> ### Name: case
> ### Title: Convert case of a string.
> ### Aliases: case str_to_lower str_to_title str_to_upper
>
> ### ** Examples
>
> dog <- "The quick brown dog"
> str_to_upper(dog)
[1] "THE QUICK BROWN DOG"
> str_to_lower(dog)
[1] "the quick brown dog"
> str_to_title(dog)
Error in stri_trans_totitle(string, opts_brkiter = stri_opts_brkiter(locale = locale)) :
  The requested ICU resource cannot be found. Possible problem: ICU data has not been downloaded yet. Call `stri_install_check()`. (U_MISSING_RESOURCE_ERROR)
Calls: str_to_title -> stri_trans_totitle -> .Call
Execution halted

Starting R to check the mentioned possible problem gives

> library(stringi)
> stri_install_check()
stringi_0.5.1; en_US.UTF-8; ICU4C 51.2; Unicode 6.2
All tests completed successfully.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-suse-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] stringi_0.5-1

loaded via a namespace (and not attached):
[1] tools_3.1.1

Cheers,
Gerhard

list in str_c

You may have considered and rejected this idea, but there are a few cases for me where this would be useful to pass a list containing only character vectors of length one. Is this something you want to support? paste currently does handle this.

Missed stringi dependance

Since now stringr depends on stringi, the latter should be included in the "Depends:" field and not only in the "Imports:" in DESCRIPTION.
In my case, not having stringi installed, update.packages() failed on stringr.

str_replace and str_replace_all should take functions

as third argument instead of fixed text.

Implementing a str_elide function for stringr

I was looking for a elide function that could shorten long strings by replacing the too long middle part by “…”. Since I couldn’t find one for r quickly (I coouldn't find one in the stringi package either), I wrote my own. I think others may also have an interest in that and I would appreciate if you could incorporate it into your package. Below is my implementation (which is public domain licensed).

str_elide = function(s, length = 20, elideText = "...") {
el = str_length(elideText)
l = (length %/% 2) - (el %/% 2)
s1 = str_sub(s, 1, l)
s2 = str_sub(s, str_length(s)-(length-el-l)+1, str_length(s))
s12 = paste0(s1, elideText, s2)
ifelse(str_length(s) > length, s12, s)
}

Feature request : adding %.% operator to concatenate strings

Using paste or str_c very fast becomes hard to read. Other languages now some kind of operator to paste together strings like '.' or '+' - it would be nice to have such a thing going as well E. G:

'%. %' <- function(a, b) paste0(a,b)

Add tools for non-ASCII charsets

e.g. guess encoding function, and stuff based on charToRaw

str_wrap

Should work similarly to strwrap but should return strings combined with newlines.

NA treatment

Often I want str_c (and friends) to behave correspondingly with e.g. sum.

I.e. sum(NA, 2) yields NA and I can do sum(NA, 2, na.rm = T) to get 2.
str_c(NA,2) yields NA2. The shortest way around this, that I've found to yield NA is
df[ ,strung_together := ifelse( any( is.na(col1), is.na(col2) ), NA, str_c(col1, col2)]

So, it would be cool to get str_c(col1, col2, na.rm = F) = NA.

Verify str_split_fixed behaviour

On empty strings and zero-length character vectors

documentation: ?stringr

version 1.0.0
?stringr gives an almost empty help:

Fast and friendly string manipulation.

Description

Fast and friendly string manipulation.

I think, it would be nice if ?stringr would list the commands of the stringr package and refer the reader to the help of the specific commands and the vignette

Add support for named capture groups

Base R supports Python-style named capture groups with the perl option.

pat <- '-(?<food>[a-z]+)-'
string <- '-bacon-'
regexpr(pat, string, perl=TRUE)

It would be great to be able to use these patterns with stringr. Right now, a pattern such as this generates an error:

str_match_all(string, regex(pat))
str_match_all(string, perl(pat))
# Error in stri_match_all_regex(string, pattern, cg_missing = "", omit_no_match = TRUE,  : 
#   Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

str_pad() breaks with NAs as input

The current version of stringr's str_pad() (stringr_0.6.2 on R_3.1.3 on Win8) function does behave unexpected in case of NA inputs:

str_pad(NA, 2, "left", 0)
## Error in rep.int(string[i], times[i]) : invalid 'times' value

... instead of giving back NA.

That beeing said, the current version on Github (via ...

devtools::install_github("Rexamine/stringi")
devtools::install_github("hadley/stringr")

... ) does behave as I would have expected by giving back NA output whenever there is NA input.

str_zpad

str_zpad <- function(string, width = max(str_length(string)), side = "left", pad = "0")
  str_pad(string, width, side, pad)

Idea from Bill Venables

Match locations in str_locate_all

Is this intended behavior?

[[1]]
     start end
[1,]     1   0
[2,]     2   1
[3,]     3   2
[4,]     4   3
[5,]     5   4

Because, to me, this shouldn't be:

[[1]]
[1] ""  "h" "e" "l" "l" "o"

(Note the first value in the vector is an empty string. I would expect "h", "e", "l", "l", "o".)

Incorrect error messages

For dev version: 0.9.0.9000

Error messages ask user to use regexp function; the function appears to be named regex. See for

> packageVersion('stringr')
[1] "0.9.0.9000"

> perl("test")
perl is deprecated. Please use regexp instead
...

> ignore.case("test")
Please use (fixed|coll|regexp)(x, ignore_case = TRUE) instead of ignore.case(x)
...
>

And this:

type.regexp <- function(x) "regex"

Seems that there is some more general confusion between regex and regexp. Outside of R, I believe regex is more common. I would nominate using regex

Problem with str_detect and perl pattern

This function working based on grepl function . But in interface (str_detect) I can't find solutions with using extend POSIX i.e perl = FALSE, value = FALSE.

Suppose I need detect simple string like a:

s1 <- paste("(?!NEGATE_TERMS_I_DONT_HAVE_IT) term1 term2", sep="")

When I trying use:

isDetect <- str_detect("string to match", s1)

I getting error:

Error in grepl("(?!NEGATE_TERMS_I_DONT_HAVE_IT) term1 term2", "Sd",
fixed = FALSE, :
invalid regular expression '(?!NEGATE_TERMS_I_DONT_HAVE_IT) term1 term2'
In addition: Warning message:
In grepl("(?!NEGATE_TERMS_I_DONT_HAVE_IT) term1 term2", "Sd", fixed = FALSE, :
regcomp error: 'Invalid regexp'

And i must use standard grepl function

isDetect <- grepl(s1, "string to match", TRUE, TRUE)

I think will be usefull use in str_detect parameter like perl = FALSE,
value = FALSE.

str_wrap Bug with Empty String Input

There was a bug introduced in the latest version of stringr in the str_wrap function. In previous versions (stringr_0.6.2), if an empty string was passed as input, the function worked fine, but stringr_1.0.0 throws an error.

Example Code

> library(stringr)
> str_wrap("",width=5)

Error in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE) : 
  argument `...` should be a character vector (or an object coercible to)

Session Info

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] graphics  grDevices datasets  stats     utils     methods   base     

other attached packages:
[1] stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5  stringi_0.4-1 tools_3.1.1

New feature request:str_between

Per your callout
Excuse me if it is already there under another guise
e.g.
myText <- "1-10 of 1,224 reviews"

res <- str_between(myText,"of "," reviews")

res # 1,224

It would be the cherry on top to have a toInteger parameter available to result in 1224

Not working for $

This could be a totally dumb question, but I am trying to strip out prices for a set of string using position of a "$". However str_detect("Jokesonme", "$") gives me TRUE even if there is no "$" in the string.

Installation failure via install_github()

I wanted the stringr vignette, which didn't seem available on CRAN, so I decided to install from GitHub and request vignette build at install time.

First I tried install_github("hadley/stringr", build_vignettes = TRUE)

> devtools::install_github("hadley/stringr", build_vignettes = TRUE)
Downloading github repo hadley/stringr@master
Installing stringr
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD build  \
  '/private/var/folders/bb/xs02zqls0snbgswbgkjwcbph0000gn/T/Rtmp0I9C3A/devtoolsed3119acff56/hadley-stringr-bd4e71f'  \
  --no-manual --no-resave-data 

* checking for file ‘/private/var/folders/bb/xs02zqls0snbgswbgkjwcbph0000gn/T/Rtmp0I9C3A/devtoolsed3119acff56/hadley-stringr-bd4e71f/DESCRIPTION’ ... OK
* preparing ‘stringr’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
Error: processing vignette 'stringr.Rmd' failed with diagnostics:
unused argument (omit_no_match = TRUE)
Execution halted
Error: Command failed (1)

Then I tried without requesting the vignette:

> devtools::install_github("hadley/stringr")
Downloading github repo hadley/stringr@master
Installing stringr
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
  '/private/var/folders/bb/xs02zqls0snbgswbgkjwcbph0000gn/T/Rtmp0I9C3A/devtoolsed315912cf67/hadley-stringr-bd4e71f'  \
  --library='/Users/jenny/resources/R/libraryCRAN' --install-tests 

* installing *source* package ‘stringr’ ...
** R
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (stringr)
Reloading installed stringr
unloadNamespace("stringr") not successful. Forcing unload.

Then I just grabbed a copy of the Rmd for the vignette, saved as foo.rmd, and tried "Knit":

processing file: foo.rmd
Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE,  : 
  unused argument (omit_no_match = TRUE)
Calls: <Anonymous> ... parse_inline -> str_locate_all -> stri_locate_all_regex
Execution halted

Then I tried walking through the code "by hand" and got my first error here:

> str_detect(strings, phone)
[1] FALSE  TRUE  TRUE  TRUE
> str_subset(strings, phone)
Error in stri_subset_regex(string, pattern, omit_na = TRUE, opts_regex = attr(pattern,  : 
  unused argument (omit_na = TRUE)

At this point, here's what session info looks like:

> devtools::session_info()
Session info---------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.1.2 (2014-10-31)
 system   x86_64, darwin10.8.0        
 ui       RStudio (0.98.1091)         
 language (EN)                        
 collate  en_CA.UTF-8                 
 tz       America/Vancouver           

Packages-------------------------------------------------------------------------------------
 package    * version    date       source                          
 devtools     1.6.0.9000 2014-11-30 Github (hadley/devtools@bd9c252)
 evaluate     0.5.5      2014-04-29 CRAN (R 3.1.0)                  
 formatR      1.0        2014-08-25 CRAN (R 3.1.1)                  
 knitr        1.8.3      2014-11-30 Github (yihui/knitr@21da020)    
 magrittr     1.5        2014-11-22 CRAN (R 3.1.2)                  
 rstudioapi   0.1        2014-03-27 CRAN (R 3.1.0)                  
 stringi      0.3.1      2014-11-06 CRAN (R 3.1.2)                  
 stringr    * 0.9.0.9000 2015-01-08 Github (hadley/stringr@bd4e71f)

bug with new stringr release? wildcard does not match \n

The R-package stringr behaves differently between R-Version 3.1.1 and 3.2.0 (verified on two different machines). Under 3.2.0 the wildcard does not match \n

simplified example:
x <- "abc\n23"

in version R-3.1.1
str_extract_all(x, "a.+?[[:digit:]]{2}")
[1] "abc\n23"

in version R-3.2.0
str_extract_all(x, "a.+?[[:digit:]]{2}")
[[1]]
character(0)

str_c with Dates broken in 1.0.0 cran + github

stringr_1.0.0

str_c("x",Sys.Date())
[1] "x16556"

stringr_1.0.0.9000

str_c("x",Sys.Date())
[1] "x16556"

stringr_0.6.2

str_c("x",Sys.Date())
[1] "x2015-05-01"

Problem with str_sub<-

mytext <- c("bob","hadley","george")
str_sub(mytext, 1, 1) <- toupper(str_sub(mytext, 1, 1))
mytex

Add Citation to the package.

Hi @hadley ,
Please could you add citation to the package. Although i could do it myself (both correct citation and pull request), I am afraid it would be better to wait for you.

In case I should do it, let me know. Thanks.

str_split creates empty leading "" when splitting on ""

Compare

str_split("abc","")
[[1]]
[1] ""  "a" "b" "c"

with

strsplit("abc","")
[[1]]
[1] "a" "b" "c"

FR: resolve incompatible pattern modification earlier

Please have incompatible search modifiers fixed vs perl / ignore.case be resolved at the call of these function and not defer to the str_* calls. Take the following example:

pattern <- 
  "str" %>%
  ignore.case %>%
  perl  # %>%
  # fixed -> pattern


str(pattern)
# Overriding Perl regexp matching
#  atomic [1:1] pattern
#  - attr(*, "ignore.case")= logi TRUE
#  - attr(*, "perl")= logi TRUE
#  - attr(*, "fixed")= logi TRUE

In this case, each of the match modifiers set an attribute to TRUE, though they are incompatible. If one were to examine pattern as is done in the example, the effects are unclear as they will be resolved later. A better method would have successive calls to the modifiers adjust pattern as appropriate. This can be done changing the functions. For example, fixed might become:

fixed <-   function(string) {
  if (stringr::is.perl(string)) 
    message("Overriding Perl regexp matching")
  structure(string, fixed = TRUE, perl = NULL, ignore.case = NULL )
}

Or perl = FALSE as another alternative.

str_pad Cannot Deal With NAs

This does not work, although it should return the missing value unchanged.

str_pad(c("hello", NA), 8)
Error in rep.int(string[i], times[i]) : invalid 'times' value

Not sure about whether other stringr functions are affected by this bug as well.

current version of stringr requires R 2.11.0 (for vapply)

The DESCRIPTION file should indicate this dependency.

str_pad doesn't accept NA's.

str_pad(c(120,123), width = 6, pad = '0')
[1] "000120" "000123"
str_pad(c(120,123,NA), width = 6, pad = '0')
Error in rep.int(string[i], times[i]) : invalid 'times' value

Just need to skip the NA's.

str_lower, str_upper, str_capitalise, and str_CamelCase

Please add functions to convert a character string to all lowercase, all UPPERCASE, all First Letters Of Words In Capitalized Case and all camelCase. You could call the functions: str_lower, str_upper, str_capitalise, and str_CamelCase.

The first two are more straightforward and should be modeled on the tolower() and toupper() in base R. The last two are more tricky to get right. One source of inspiration could be the tocamel() function in the development version of the 'rapport' package: https://github.com/Rapporter/rapport/tree/development . The associated issues have been partially discussed on r-help: http://r.789695.n4.nabble.com/how-to-transform-string-to-quot-Camel-Case-quot-td4664222.html

Should you decide to take the 'rapport' approach and merge str_capitalise and str_CamelCase into one function, then you could call it str_camel.

word function

word() grabs words from char strings. For example:
str = 'abc.123.999..'
word(str, 1, delim='.') would return 'abc'
word(str, 2, delim='.') would return '123'
word(str, -1, delim='.') would return '999'

suggested by David Cooper

Check str_match/str_match_all output with no matches

e.g.

str_match_all("abc", "d")
str_match("abc", "d")

Should have one row for each input, and one column for each match + 1.

Bug in str_subset / fixed

> str_subset("I", fixed("i", ignore_case = TRUE))
character(0)

I was expecting to get "I", not the empty string.

ignore_case is not working for ICU regex() patterns

It doesn't seem like ignore_case argument is working for regex patterns:

library(stringr)
x <- c("a", "A")
str_detect(regex("a"), x)

gives

[1]  TRUE FALSE

and

str_detect(regex("a", ignore_case = TRUE), x)

gives

[1]  TRUE FALSE

My system is

> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.1.3 (2015-03-09)
 system   x86_64, darwin13.4.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            

Packages -----------------------------------------------------------------------
 package   * version  date       source        
 bitops      1.0-6    2013-08-17 CRAN (R 3.1.0)
 devtools    1.8.0    2015-05-09 CRAN (R 3.1.3)
 digest      0.6.8    2014-12-31 CRAN (R 3.1.2)
 git2r       0.10.1   2015-05-07 CRAN (R 3.1.3)
 magrittr    1.5      2014-11-22 CRAN (R 3.1.2)
 memoise     0.2.1    2014-04-22 CRAN (R 3.1.0)
 RCurl       1.95-4.6 2015-04-24 CRAN (R 3.1.3)
 rversions   1.0.0    2015-04-22 CRAN (R 3.1.3)
 stringi     0.4-1    2014-12-14 CRAN (R 3.1.2)
 stringr   * 1.0.0    2015-04-30 CRAN (R 3.1.3)
 XML         3.98-1.1 2013-06-20 CRAN (R 3.1.0)

New Feature Request - fuzzywuzzy-style string matching/scoring

For awhile now, I've wanted a way to use fuzzywuzzy in R. I've even tried installing R-Python translators to no avail. If stringr could include any part of this type of functionality, it would make my life much, much easier.

str_trim(character(0)) should return "", not character(0)

Ref: r-lib/pkgdown#49

str_detect() feature suggestion

For example, I want to validate the string argument of a function with a regex and the argument must exactly match the regex.

dummy <- function(x) {
    stopifnot(str_detect(x, "[ABC]{3}"))
}

I want this function to accept only argument in the format of "BBC", "AAA", "CBC" or "AAB". But I don't want this function to accept "ABCD" or "AAAA".
One approach is str_extract(x, "[ABC]{3}") == x but it is not intuitive.

UPDATE: perhaps I should use a better regex. Thanks gagolwes.

Named capture groups

Like in python. e.g.

> str_match(strings,"([2-9][0-9]{2})[- .](?P<area>[0-9]{3})[- .]([0-9]{4})")
                    area
 [1,] "219 733 8965" "219" "733" "8965"
 [2,] "329-293-8753" "329" "293" "8753"
 [3,] NA             NA    NA    NA
 [4,] "595 794 7569" "595" "794" "7569"

documentation typo, "regexp" should be "regex"

In ?perl and the deprecation message that prints when you use perl, regexp is referred to instead of regex.

Recommendation: exact match modifier

I recommend adding an exact match modifier like perl, fixed and ignore.case.

The exact modifier should match only on exact matches unlike fixed which matches on part of the string. Although this can be done using ==, the exact modifier would allow developers a parallel idiom to switch between exact and less-exact matchings.

An alternative, of course, is to use perl with a pattern wrapped between ^ and $, but this solutions required applying a function to the pattern and not the string and thus breaking the parallel construction.

Internally, this could use the perl construct described in the preceding paragraph or use the ==, which should be faster.

This could probably be implemented -entirely- mostly within the re_call function.