Giter VIP home page Giter VIP logo

Comments (3)

gagolews avatar gagolews commented on June 21, 2024

I am afraid this is too specific to be included in stringi.

Perhaps an easier solution: ?

x <- "test 123 ↓ęœß→óęœ©œ©ęπœęπœ©œπą"
x <- unique(unlist(stringi::stri_enc_toutf32(x)))
x <- x[x>127]
stringi::stri_enc_fromutf32(x)
## [1] "↓ęœß→ó©πą"

from stringi.

discoleo avatar discoleo commented on June 21, 2024
  1. Function stri_enc_toutf32 does indeed the conversion directly. Unfortunately, I am not an expert in the package stringi.

    • it would break if an UTF-64 was introduced: but then again this should be an internal implementation detail inside another function and would be therefore invisible to the end user;
  2. Documentation/Examples with stringi::stri_escape_unicode

  • for the last solution: I would include an extra example mentioning how to convert back to Unicode code points or how to escape the characters with stri_escape_unicode;
  • alternatively, the function could have an option to either return the code-points or the characters;
  1. Utility
    I have frequently encountered this situation: both when trying to extract information from articles on Pubmed, as well as from various reports (e.g. Lab-reports). There are probably sufficiently large communities involved in both types of operations, but having only rudimentary understanding of string encodings. If the (manual) cleaning is too time-consuming, then the most common options are:
  • to exclude those inputs, or to delete any non-ASCII characters (if the user understands a little bit more about ASCII);
  • both approaches are quite sub-optimal: better tools can come in handy for many simple users;

from stringi.

gagolews avatar gagolews commented on June 21, 2024

So, you probably mean:

x <- "test 123 ↓ęœß→óęœ©œ©ęπœęπœ©œπą"
x <- unique(unlist(stringi::stri_enc_toutf32(x)))
x <- as.list(x[x>127])
stringi::stri_escape_unicode(stringi::stri_enc_fromutf32(x))
## [1] "\\u2193" "\\u0119" "\\u0153" "\\u00df" "\\u2192" "\\u00f3" "\\u00a9" "\\u03c0" "\\u0105"

from stringi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.