Giter VIP home page Giter VIP logo

gagolews / stringi Goto Github PK

View Code? Open in Web Editor NEW
293.0 21.0 45.0 215.06 MB

Fast and portable character string processing in R (with the Unicode ICU)

Home Page: https://stringi.gagolewski.com/

License: Other

R 4.13% Shell 0.08% HTML 9.10% TeX 0.89% C++ 70.97% CSS 0.19% C 14.33% M4 0.13% Makefile 0.02% Smarty 0.01% Python 0.05% Batchfile 0.01% Perl 0.12%
stringi icu icu4c r regex regexp string-manipulation unicode natural-language-processing text-processing text stringr nlp tidy-data

stringi's Issues

stri_paste bug


stri_paste("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","kot",NA)
*** glibc detected *** /usr/lib/rstudio/bin/rsession: free(): invalid next size (fast): 0x00000000027e4460 ***
*** glibc detected *** /usr/lib/rstudio/bin/rsession: malloc(): memory corruption: 0x00000000027e4480 ***

stri_split_newline

Split with any of the following. If EOLs are not consistent, generate a warning.

Input: character vector. Output: List of character vectors.

Newline chars according to Unicode TR:

http://www.unicode.org/reports/tr18/ : (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

http://www.unicode.org/standard/reports/tr13/tr13-5.html Unicode Technical Report #13
Unicode Newline Guidelines




   Unicode  ASCII   EBCDIC 1/2
 CR
     000D    0D      0D      0D

 LF
     000A    0A      25      15

 CRLF
     000D,000A   0D,0A   0D,25   0D,15

 NEL*
     0085    85      15      25

 VT
     000B    0B      0B      0B

 FF
     000C    0C      0C      0C

 LS
     2028    n/a     n/a     n/a

 PS
     2029    n/a     n/a     n/a

stri_indices_merge [internal]

stri_locate_all_charclass::

if (merge) return(ret)
else return(lapply(ret, function(m) {
if (is.na(m[1,1])) return(m)
idx <- unlist(apply(m, 1, function(k) k[1]:k[2]))
matrix(idx, ncol=2, nrow=length(idx),
dimnames=list(NULL,c('start', 'end')))
}))

BUG in stri_detect_regex and/or ICU regex engine

Two of the following tests fail:

   expect_identical(stri_detect_regex("aaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaaa"), TRUE)
   expect_identical(stri_detect_regex("aaaaaaaaaaaaaaa",  "aaaaaaaaaaaaaaa"), TRUE)
   expect_identical(stri_detect_regex("aaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaa"), TRUE)

seems that a long pattern makes ICU regex engine stupid.

This is a critical BUG - we need a workaround.

stri_split_fixed

stri_split_fixed(str, split, omitempty) - vectorized over str, split and omitempty

  1. DONE
    stri_split_fixed(c("A B", "AB"), " ") == list(c("A", "B"), "AB")
    stri_split_fixed(c("A B", "A1B"), c(" ", "1")) == list(c("A", "B"), c("A", "B"))
    stri_split_fixed("A B1C", c(" ", "1")) == list(c("A", "B1C"), c("A B", "C"))

  2. DONE
    omitempty handles this

// sequences of splitting characters - ignore multiple
stri_split_fixed(c("ABA", "ABBABBBAA"), "B") == list(c("A", "A"), c("A", "A", "AA")

stri_split_fixed(c("ABA", "ABBABBBAA"), "BB") == list("ABA", c("A", "A", "AA")
// or (?):
stri_split_fixed(c("ABA", "ABBABBBAA"), "BB") == list("ABA", c("A", "A", "BAA")
// maybe add an argument to control overlapping splitting sequences

  1. correct UTF-8
    // char- by-char (assume temporarily that we have ASCII-encoded strings
    // things will go a little bit more complicated with UTF-8
    stri_split_fixed(c("A B", "AB"), "") == list(c("A", " ", "B"), c("A", "B"))

http://www.icu-project.org/apiref/icu4c/utf8_8h.html:

if U8_IS_SINGLE(c) is TRUE then we treat c as a single char
U8_NEXT or U8_NEXT_UNSAFE may be used to iterate through the string (char*);
something like (see http://userguide.icu-project.org/strings/utf-8)

for(int i=0; i<length;) {
UChar32 c; // this is a single UNICODE code point
U8_NEXT(s, i, length, c);
// process c
}

  1. Wrong NA behavior - DONE
    stri_split(NA_character_,"A")
    [[1]]
    [1] "N" ""

stri_split_charclass()

Like REVERSE(stri_locate_all_class() see #12) and stri_sub() (see #14) combined

This may be done in pure R; OR (better) via new (maybe internal) function stri_split_pos() (which assumes that positions are ordered increasingly)

stri_flatten

stri_flatten - join all strings from one character vector into one string (this has already been done
missing - additional argument (sep? collapse?) which separates the strings (defaults to "")

stri_trim

stri_trim done
stri_ltrim done
stri_rtrim done

stri_trim_all - trim all whitespace - almost done
NA and character(0) as usual.
stri_trim_all(" a a \t a \n a ")=="a a a a"
stri_trim_all(" ") == ""
stri_trim_all("") == ""

this function uses stri_split_fixed and stri_flatten

stri_width

stri_width - determine the width of a UTF-8-encoded string when printed on screen.

e.g. "\u0061\u0328" (61 cc a8) - a + ogonek (ą) - width=1
"\u0105" (c4 85) - a WITH ogonek (ą) - width=1
"\u12468" (e3 82 b0) - グ - width=2

stri_totitle - explicit use of custom BreakIterators?

> stri_totitle("pining for the fjords-yes, i'm brian", "en_US")
[1] "Pining For The Fjords-Yes, I'm Brian"

this is not the correct result for en_US locale. it should be:

[1] "Pining for the Fjords-Yes, I'm Brian"

Currently I get U_USING_DEFAULT_WARNING whet querying BreakIterator::createWordInstance (even though I use full ICU data lib)

The question is: is correct title casing possible with "raw" ICU?

stri_sub() and stri_sub<-() should accept lists

> x <- "ala"; stri_sub(x, c(2,1), c(2,1)) <- c("Y", "Z")
> x
[1] "aYa" "Zla"

someday we may wish to add a function that is not vectorized w.r.t. to the first element,
so that we obtain "ZYa" (from/to/length should be sorted increasingly)

OR

the above setting will be strange (non-vectorized? OMG!), so we may accept lists of integer vectors as from/to/length

stri_read_lines bad encoding

When I want to read some files with argument encoding = "auto", sometimes it detects bad encoding and I have bad data in variable.

? stri_ranges_union(), stri_ranges_intersect(), stri_ranges_diff() ?

These functions may be useful when operating on results of stri_locate_*

e.g.
x <- stri_locate_all_charclass(str, "WHITESPACE")
y <- stri_locate_all_charclass(str, "L")
stri_sub(str, stri_ranges_union(x, y)) # extract whitespaces and letters

some options:

  • operate row by row on a single matrix (stri_locate_first/last)
  • operate on all rows on a matrix or a list of matrices (stri_locate_all)

these may be auto-detected, i.e. whether we have a list of matrices or a single matrix

stri_chartype

  1. Get the general category value for each code point in each UTF-8 string;
    Same as java.lang.Character.getType()
  2. Char type info (data frame? / results as factors?)
  3. Allow for native, non-UTF-8 encodings

stri_sub()

stri_sub(s, from, to, length)

  • to and length args are mutually exclusive; (?missing)
  • omitting to and length means 'to the end of the string'; (?missing)
  • the function is vectorized over s, from, and length.
  • ASSERT: length(from)==length(sub)
  • note that from/to/length are indicating Unicode code points, not bytes
  • OPTIONALLY: when from is omitted then either to or length must be integer matrices with 2 columns (what do you think?)

stri_replace(str, setWhat, setTo)

allow a different vectorization scheme in stri_replace* - outputs 1 string resulting
it sth similar to multiple calls

for (i in seq_along(setWhat))
    str <- stri_replace(str, setWhat[i], setTo[i])

stri_length

stri_length - count the number of Unicode code points in a UTF-8-encoded string

stri_read_lines, stri_write_lines

two wrappers for readLines and writeLines - get/set whole content of a text file with possible input encoding auto detection / output reencoding

stri_locate_*_regex add capture_groups arg

allow for capturing subgroups of regexps
RegexMatcher::

virtual int32_t start (int32_t group, UErrorCode &status) const
virtual int32_t end (int32_t group, UErrorCode &status) const
virtual UnicodeString group (UErrorCode &status) const

stri_charcategories

  1. Get the two-letter category name for each Unicode character category identifier
  2. Allow for native, non-UTF-8 encodings

stri_detect_regex

> library(stringi)
> stri_detect_regex("ala ma kota","a")
Error in stri_detect_regex("ala ma kota", "a") : U_MISSING_RESOURCE_ERROR
> stri_detect_regex("ala ma kota","a")
[1] TRUE

Is this a correct behavior? What this error means?

stri_encset()

a function to change default (ie. current) ICU character encoding used.
Also, warn if there will be problems with R after setting this charset (non-ASCII superset, no 1-to-1 UChar conversion, etc.)

stri_sub_index()

stri_subindex("abcde", 1:2) == "ab"
stri_subindex("abcde", c(1,3,5)) == "ace"
stri_subindex("abcde", c(5,5,2)) == "eeb"
stri_subindex("abcde", c(-1,-5)), == "bcd"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.