gagolews / stringi Goto Github PK

Fast and portable character string processing in R (with the Unicode ICU)

Home Page: https://stringi.gagolewski.com/

License: Other

R 4.13% Shell 0.08% HTML 9.10% TeX 0.89% C++ 70.97% CSS 0.19% C 14.33% M4 0.13% Makefile 0.02% Smarty 0.01% Python 0.05% Batchfile 0.01% Perl 0.12%

stringi icu icu4c r regex regexp string-manipulation unicode natural-language-processing text-processing text stringr nlp tidy-data

stringi's Issues

stri_*_fixed: add an option to take into account overlapping pattern matches

it'll be easy :)

stri_paste bug


stri_paste("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","kot",NA)
*** glibc detected *** /usr/lib/rstudio/bin/rsession: free(): invalid next size (fast): 0x00000000027e4460 ***
*** glibc detected *** /usr/lib/rstudio/bin/rsession: malloc(): memory corruption: 0x00000000027e4480 ***

stri_split_newline

Split with any of the following. If EOLs are not consistent, generate a warning.

Input: character vector. Output: List of character vectors.

Newline chars according to Unicode TR:

http://www.unicode.org/reports/tr18/ : (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

http://www.unicode.org/standard/reports/tr13/tr13-5.html Unicode Technical Report #13
Unicode Newline Guidelines




   Unicode  ASCII   EBCDIC 1/2
 CR
     000D    0D      0D      0D

 LF
     000A    0A      25      15

 CRLF
     000D,000A   0D,0A   0D,25   0D,15

 NEL*
     0085    85      15      25

 VT
     000B    0B      0B      0B

 FF
     000C    0C      0C      0C

 LS
     2028    n/a     n/a     n/a

 PS
     2029    n/a     n/a     n/a

stri_order, stri_cmp - byte compare, no collation on collate_opts = NA

missing functionality - collate_opts=NA

if (merge) return(ret)
else return(lapply(ret, function(m) {
if (is.na(m[1,1])) return(m)
idx <- unlist(apply(m, 1, function(k) k[1]:k[2]))
matrix(idx, ncol=2, nrow=length(idx),
dimnames=list(NULL,c('start', 'end')))
}))

stri_locate_all_fixed(), stri_locate_first_fixed(), stri_locate_last_fixed()

Find all/first/last position of a occurence of substr in str (vectorized over str and substr)

match, pmatch

BUG in stri_detect_regex and/or ICU regex engine

Two of the following tests fail:

   expect_identical(stri_detect_regex("aaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaaa"), TRUE)
   expect_identical(stri_detect_regex("aaaaaaaaaaaaaaa",  "aaaaaaaaaaaaaaa"), TRUE)
   expect_identical(stri_detect_regex("aaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaa"), TRUE)

seems that a long pattern makes ICU regex engine stupid.

This is a critical BUG - we need a workaround.

stri_startswith & stri_endswith

stri_startswith & stri_endswith
logical functions, like java startsWith and endsWith

stri_enc_detect

Detect string encoding, see http://userguide.icu-project.org/conversion/detection
http://www.icu-project.org/apiref/icu4c/ucsdet_8h.html#details

stri_split_fixed

stri_split_fixed(str, split, omitempty) - vectorized over str, split and omitempty

DONE
stri_split_fixed(c("A B", "AB"), " ") == list(c("A", "B"), "AB")
stri_split_fixed(c("A B", "A1B"), c(" ", "1")) == list(c("A", "B"), c("A", "B"))
stri_split_fixed("A B1C", c(" ", "1")) == list(c("A", "B1C"), c("A B", "C"))
DONE
omitempty handles this

// sequences of splitting characters - ignore multiple
stri_split_fixed(c("ABA", "ABBABBBAA"), "B") == list(c("A", "A"), c("A", "A", "AA")

stri_split_fixed(c("ABA", "ABBABBBAA"), "BB") == list("ABA", c("A", "A", "AA")
// or (?):
stri_split_fixed(c("ABA", "ABBABBBAA"), "BB") == list("ABA", c("A", "A", "BAA")
// maybe add an argument to control overlapping splitting sequences

correct UTF-8
// char- by-char (assume temporarily that we have ASCII-encoded strings
// things will go a little bit more complicated with UTF-8
stri_split_fixed(c("A B", "AB"), "") == list(c("A", " ", "B"), c("A", "B"))

http://www.icu-project.org/apiref/icu4c/utf8_8h.html:

if U8_IS_SINGLE(c) is TRUE then we treat c as a single char
U8_NEXT or U8_NEXT_UNSAFE may be used to iterate through the string (char*);
something like (see http://userguide.icu-project.org/strings/utf-8)

for(int i=0; i<length;) {
UChar32 c; // this is a single UNICODE code point
U8_NEXT(s, i, length, c);
// process c
}

Wrong NA behavior - DONE
stri_split(NA_character_,"A")
[[1]]
[1] "N" ""

stri_detect_fixed - use collator_opts

now we have only byte comparison, collator code is half-ready

stri_enc_tonative - new fun

like enc2native

stri_(un)escape_basic

like stri_escape_unicode, but without escaping > 127 codes in UTF

search_*_fixed: use Knuth-Morris-Pratt search algorithm

Implement Knuth–Morris–Pratt algorithm for every search_byte function.

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

stri_split_charclass()

Like REVERSE(stri_locate_all_class() see #12) and stri_sub() (see #14) combined

This may be done in pure R; OR (better) via new (maybe internal) function stri_split_pos() (which assumes that positions are ordered increasingly)

stri_generate_pswd / generate random strings ?

Write function which generates password. Add options like digits, letters, capital, special to determine which symbols should be used to generate password.

Inspiration:
http://stackoverflow.com/questions/22219035/function-to-generate-a-random-password

stri_flatten

stri_flatten - join all strings from one character vector into one string (this has already been done
missing - additional argument (sep? collapse?) which separates the strings (defaults to "")

stri_trans_char (implementation of chartr)

> chartr("ABC", "abc", "BARA BACA")
[1] "baRa baca"

stri_trim

stri_trim done
stri_ltrim done
stri_rtrim done

stri_trim_all - trim all whitespace - almost done
NA and character(0) as usual.
stri_trim_all(" a a \t a \n a ")=="a a a a"
stri_trim_all(" ") == ""
stri_trim_all("") == ""

this function uses stri_split_fixed and stri_flatten

stri_*_regex: allow for case-insensitive matches

TODO: add ignore_case argument (stri_prepare_arg_string_1) to all regex functions

Activates UREGEX_CASE_INSENSITIVE flag for ICU RegexMatcher, see http://www.icu-project.org/apiref/icu4c/uregex_8h.html#a874989dfec4cbeb6baf4d1a51cb529aea909d2ed2c61e34cb62dc13e29f6923ec

stri_pad

like str_pad

stri_width

stri_width - determine the width of a UTF-8-encoded string when printed on screen.

e.g. "\u0061\u0328" (61 cc a8) - a + ogonek (ą) - width=1
"\u0105" (c4 85) - a WITH ogonek (ą) - width=1
"\u12468" (e3 82 b0) - グ - width=2

stri_totitle - explicit use of custom BreakIterators?

> stri_totitle("pining for the fjords-yes, i'm brian", "en_US")
[1] "Pining For The Fjords-Yes, I'm Brian"

this is not the correct result for en_US locale. it should be:

[1] "Pining for the Fjords-Yes, I'm Brian"

Currently I get U_USING_DEFAULT_WARNING whet querying BreakIterator::createWordInstance (even though I use full ICU data lib)

The question is: is correct title casing possible with "raw" ICU?

stri_trim_double [ -> stri_replace_all_charclass, merge=TRUE]

remove double e.g whitespaces (charclass) -> single

stri_locate_last_regex

HTML Entities - encode & decode

Something like: http://php.net/manual/en/function.html-entity-decode.php and http://php.net/manual/en/function.htmlentities.php and also http://www.php.net/manual/en/function.htmlspecialchars.php

stri_sub() and stri_sub<-() should accept lists

> x <- "ala"; stri_sub(x, c(2,1), c(2,1)) <- c("Y", "Z")
> x
[1] "aYa" "Zla"

someday we may wish to add a function that is not vectorized w.r.t. to the first element,
so that we obtain "ZYa" (from/to/length should be sorted increasingly)

the above setting will be strange (non-vectorized? OMG!), so we may accept lists of integer vectors as from/to/length

ICU Collator settings via stri_collator_options()

allow stri_compare, stri_order, and stri_*_fixed to use all options described in http://www.icu-project.org/apiref/icu4c/ucol_8h.html#a583fbe7fc4a850e2fcc692e766d2826c

StriContainerUTF8: detect BOMs

If BOM is found in UTF8, ~~warn and~~ remove it

stri_localeset()

a function to change default (ie. current) ICU locale

stri_read_lines bad encoding

When I want to read some files with argument encoding = "auto", sometimes it detects bad encoding and I have bad data in variable.

? stri_ranges_union(), stri_ranges_intersect(), stri_ranges_diff() ?

These functions may be useful when operating on results of stri_locate_*

e.g.
x <- stri_locate_all_charclass(str, "WHITESPACE")
y <- stri_locate_all_charclass(str, "L")
stri_sub(str, stri_ranges_union(x, y)) # extract whitespaces and letters

some options:

operate row by row on a single matrix (stri_locate_first/last)
operate on all rows on a matrix or a list of matrices (stri_locate_all)

these may be auto-detected, i.e. whether we have a list of matrices or a single matrix

stri_chartype

Get the general category value for each code point in each UTF-8 string;
Same as java.lang.Character.getType()
Char type info (data frame? / results as factors?)
Allow for native, non-UTF-8 encodings

escape all non ASCII characters in a string for R use (\uXXXX, \UXXXXXXXX)

http://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html

stri_replace_fixed, stri_locate_fixed_* - use collator & StriContainerUTF8 for byte search

see stri_count_fixed / stri_count_fixed_bytes for inspiration
http://www.icu-project.org/apiref/icu4c/usearch_8h.html

int32_t usearch_getMatchedStart (const UStringSearch *strsrch)

int32_t usearch_getMatchedLength (const UStringSearch *strsrch)

int32_t usearch_getMatchedText

stri_sub()

stri_sub(s, from, to, length)

to and length args are mutually exclusive; (?missing)
omitting to and length means 'to the end of the string'; (?missing)
the function is vectorized over s, from, and length.
ASSERT: length(from)==length(sub)
note that from/to/length are indicating Unicode code points, not bytes
OPTIONALLY: when from is omitted then either to or length must be integer matrices with 2 columns (what do you think?)

stri_replace(str, setWhat, setTo)

allow a different vectorization scheme in stri_replace* - outputs 1 string resulting
it sth similar to multiple calls

for (i in seq_along(setWhat))
    str <- stri_replace(str, setWhat[i], setTo[i])

stri_length

stri_length - count the number of Unicode code points in a UTF-8-encoded string

stri_replace_*_charclass()

stri_read_lines, stri_write_lines

two wrappers for readLines and writeLines - get/set whole content of a text file with possible input encoding auto detection / output reencoding

stri_locate_*_regex add capture_groups arg

allow for capturing subgroups of regexps
RegexMatcher::

virtual int32_t start (int32_t group, UErrorCode &status) const
virtual int32_t end (int32_t group, UErrorCode &status) const
virtual UnicodeString group (UErrorCode &status) const

MIME base64 - encode & decode

see e.g. http://php.net/manual/en/function.base64-encode.php and http://www.php.net/manual/en/function.base64-decode.php

stri_charcategories

Get the two-letter category name for each Unicode character category identifier
Allow for native, non-UTF-8 encodings

stri_detect_regex

> library(stringi)
> stri_detect_regex("ala ma kota","a")
Error in stri_detect_regex("ala ma kota", "a") : U_MISSING_RESOURCE_ERROR
> stri_detect_regex("ala ma kota","a")
[1] TRUE

Is this a correct behavior? What this error means?

stri_subindex("abcde", 1:2) == "ab"
stri_subindex("abcde", c(1,3,5)) == "ace"
stri_subindex("abcde", c(5,5,2)) == "eeb"
stri_subindex("abcde", c(-1,-5)), == "bcd"

gagolews / stringi Goto Github PK

stringi's Issues

Recommend Projects

Recommend Topics

Recommend Org