gagolews / stringi Goto Github PK
View Code? Open in Web Editor NEWFast and portable character string processing in R (with the Unicode ICU)
Home Page: https://stringi.gagolewski.com/
License: Other
Fast and portable character string processing in R (with the Unicode ICU)
Home Page: https://stringi.gagolewski.com/
License: Other
like str_wrap
it'll be easy :)
stri_paste("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","kot",NA)
*** glibc detected *** /usr/lib/rstudio/bin/rsession: free(): invalid next size (fast): 0x00000000027e4460 ***
*** glibc detected *** /usr/lib/rstudio/bin/rsession: malloc(): memory corruption: 0x00000000027e4480 ***
Split with any of the following. If EOLs are not consistent, generate a warning.
Input: character vector. Output: List of character vectors.
Newline chars according to Unicode TR:
http://www.unicode.org/reports/tr18/ : (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]
http://www.unicode.org/standard/reports/tr13/tr13-5.html Unicode Technical Report #13
Unicode Newline Guidelines
Unicode ASCII EBCDIC 1/2
CR
000D 0D 0D 0D
LF
000A 0A 25 15
CRLF
000D,000A 0D,0A 0D,25 0D,15
NEL*
0085 85 15 25
VT
000B 0B 0B 0B
FF
000C 0C 0C 0C
LS
2028 n/a n/a n/a
PS
2029 n/a n/a n/a
missing functionality - collate_opts=NA
stri_locate_all_charclass::
if (merge) return(ret)
else return(lapply(ret, function(m) {
if (is.na(m[1,1])) return(m)
idx <- unlist(apply(m, 1, function(k) k[1]:k[2]))
matrix(idx, ncol=2, nrow=length(idx),
dimnames=list(NULL,c('start', 'end')))
}))
Find all/first/last position of a occurence of substr in str (vectorized over str and substr)
Two of the following tests fail:
expect_identical(stri_detect_regex("aaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaaa"), TRUE)
expect_identical(stri_detect_regex("aaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaa"), TRUE)
expect_identical(stri_detect_regex("aaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaa"), TRUE)
seems that a long pattern makes ICU regex engine stupid.
This is a critical BUG - we need a workaround.
stri_startswith & stri_endswith
logical functions, like java startsWith and endsWith
Detect string encoding, see http://userguide.icu-project.org/conversion/detection
http://www.icu-project.org/apiref/icu4c/ucsdet_8h.html#details
stri_split_fixed(str, split, omitempty)
- vectorized over str, split and omitempty
DONE
stri_split_fixed(c("A B", "AB"), " ") == list(c("A", "B"), "AB")
stri_split_fixed(c("A B", "A1B"), c(" ", "1")) == list(c("A", "B"), c("A", "B"))
stri_split_fixed("A B1C", c(" ", "1")) == list(c("A", "B1C"), c("A B", "C"))
DONE
omitempty handles this
// sequences of splitting characters - ignore multiple
stri_split_fixed(c("ABA", "ABBABBBAA"), "B") == list(c("A", "A"), c("A", "A", "AA")
stri_split_fixed(c("ABA", "ABBABBBAA"), "BB") == list("ABA", c("A", "A", "AA")
// or (?):
stri_split_fixed(c("ABA", "ABBABBBAA"), "BB") == list("ABA", c("A", "A", "BAA")
// maybe add an argument to control overlapping splitting sequences
http://www.icu-project.org/apiref/icu4c/utf8_8h.html:
if U8_IS_SINGLE(c) is TRUE then we treat c as a single char
U8_NEXT or U8_NEXT_UNSAFE may be used to iterate through the string (char*);
something like (see http://userguide.icu-project.org/strings/utf-8)
for(int i=0; i<length;) {
UChar32 c; // this is a single UNICODE code point
U8_NEXT(s, i, length, c);
// process c
}
now we have only byte comparison, collator code is half-ready
like enc2native
like stri_escape_unicode, but without escaping > 127 codes in UTF
Implement Knuth–Morris–Pratt algorithm for every search_byte function.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
Write function which generates password. Add options like digits, letters, capital, special to determine which symbols should be used to generate password.
Inspiration:
http://stackoverflow.com/questions/22219035/function-to-generate-a-random-password
stri_flatten - join all strings from one character vector into one string (this has already been done
missing - additional argument (sep? collapse?) which separates the strings (defaults to "")
> chartr("ABC", "abc", "BARA BACA")
[1] "baRa baca"
stri_trim
done
stri_ltrim
done
stri_rtrim
done
stri_trim_all
- trim all whitespace - almost done
NA and character(0) as usual.
stri_trim_all(" a a \t a \n a ")=="a a a a"
stri_trim_all(" ") == ""
stri_trim_all("") == ""
this function uses stri_split_fixed
and stri_flatten
TODO: add ignore_case
argument (stri_prepare_arg_string_1) to all regex functions
Activates UREGEX_CASE_INSENSITIVE flag for ICU RegexMatcher, see http://www.icu-project.org/apiref/icu4c/uregex_8h.html#a874989dfec4cbeb6baf4d1a51cb529aea909d2ed2c61e34cb62dc13e29f6923ec
like str_pad
stri_width - determine the width of a UTF-8-encoded string when printed on screen.
e.g. "\u0061\u0328" (61 cc a8) - a + ogonek (ą) - width=1
"\u0105" (c4 85) - a WITH ogonek (ą) - width=1
"\u12468" (e3 82 b0) - グ - width=2
> stri_totitle("pining for the fjords-yes, i'm brian", "en_US")
[1] "Pining For The Fjords-Yes, I'm Brian"
this is not the correct result for en_US locale. it should be:
[1] "Pining for the Fjords-Yes, I'm Brian"
Currently I get U_USING_DEFAULT_WARNING
whet querying BreakIterator::createWordInstance
(even though I use full ICU data lib)
The question is: is correct title casing possible with "raw" ICU?
remove double e.g whitespaces (charclass) -> single
> x <- "ala"; stri_sub(x, c(2,1), c(2,1)) <- c("Y", "Z")
> x
[1] "aYa" "Zla"
someday we may wish to add a function that is not vectorized w.r.t. to the first element,
so that we obtain "ZYa"
(from/to/length should be sorted increasingly)
OR
the above setting will be strange (non-vectorized? OMG!), so we may accept lists of integer vectors as from/to/length
allow stri_compare, stri_order, and stri_*_fixed to use all options described in http://www.icu-project.org/apiref/icu4c/ucol_8h.html#a583fbe7fc4a850e2fcc692e766d2826c
If BOM is found in UTF8, warn and remove it
a function to change default (ie. current) ICU locale
When I want to read some files with argument encoding = "auto", sometimes it detects bad encoding and I have bad data in variable.
These functions may be useful when operating on results of stri_locate_*
e.g.
x <- stri_locate_all_charclass(str, "WHITESPACE")
y <- stri_locate_all_charclass(str, "L")
stri_sub(str, stri_ranges_union(x, y)) # extract whitespaces and letters
some options:
these may be auto-detected, i.e. whether we have a list of matrices or a single matrix
see stri_count_fixed / stri_count_fixed_bytes for inspiration
http://www.icu-project.org/apiref/icu4c/usearch_8h.html
int32_t usearch_getMatchedStart (const UStringSearch *strsrch)
int32_t usearch_getMatchedLength (const UStringSearch *strsrch)
int32_t usearch_getMatchedText
stri_sub(s, from, to, length)
to
and length
args are mutually exclusive; (?missing)
to
and length
means 'to the end of the string'; (?missing)
s
, from
, and length
.length(from)==length(sub)
from
/to
/length
are indicating Unicode code points, not bytesfrom
is omitted then either to
or length
must be integer matrices with 2 columns (what do you think?)allow a different vectorization scheme in stri_replace* - outputs 1 string resulting
it sth similar to multiple calls
for (i in seq_along(setWhat))
str <- stri_replace(str, setWhat[i], setTo[i])
stri_length - count the number of Unicode code points in a UTF-8-encoded string
two wrappers for readLines and writeLines - get/set whole content of a text file with possible input encoding auto detection / output reencoding
allow for capturing subgroups of regexps
RegexMatcher::
virtual int32_t start (int32_t group, UErrorCode &status) const
virtual int32_t end (int32_t group, UErrorCode &status) const
virtual UnicodeString group (UErrorCode &status) const
> library(stringi)
> stri_detect_regex("ala ma kota","a")
Error in stri_detect_regex("ala ma kota", "a") : U_MISSING_RESOURCE_ERROR
> stri_detect_regex("ala ma kota","a")
[1] TRUE
Is this a correct behavior? What this error means?
a function to change default (ie. current) ICU character encoding used.
Also, warn if there will be problems with R after setting this charset (non-ASCII superset, no 1-to-1 UChar conversion, etc.)
Get all positions (bytes) at which we find a Unicode character belonging to a particular General Category or fulfilling a Binary Property
UPDATE: get also the first and last position
stri_subindex("abcde", 1:2) == "ab"
stri_subindex("abcde", c(1,3,5)) == "ace"
stri_subindex("abcde", c(5,5,2)) == "eeb"
stri_subindex("abcde", c(-1,-5)), == "bcd"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.