Giter VIP home page Giter VIP logo

Comments (5)

paoloambrosio avatar paoloambrosio commented on July 19, 2024

To me it seems like "D\u00C4\u0085\u00C5\u009Bl\u00C4\u0099\u00C5\u00BCyn\u00C3\u00B3w,Oslo,Stockholm" is the correct JSON encode for "Dąślężynów,Oslo,Stockholm". I don't understand why Cucumber is sending rubbish back. I'll look into it possibly later today.

from cucumber-cpp.

PiotrNakonczy-TomTom avatar PiotrNakonczy-TomTom commented on July 19, 2024

Actually I think this encoding string is already wrong - look at the number of encoded chars.
F.e. the letter 'ą' which is 'LATIN SMALL LETTER A WITH OGONEK' according to http://www.utf8-chartable.de/unicode-utf8-table.pl?start=256 and it's codepoint should be u0105, gets interpreted as 2 separate chars u00C4 and u0085.
This leads me to thinking that the problem appears maybe when reading unicode text from iostream or sth.
Basically the iswprint() function call does not recognize 'ą' and similar as printable character.

from cucumber-cpp.

paoloambrosio avatar paoloambrosio commented on July 19, 2024

You are right: it should have been "D\u0105..."

from cucumber-cpp.

paoloambrosio avatar paoloambrosio commented on July 19, 2024

The bug looks like a Json Spirit issue in dealing with unicode characters when using 8bit characters. This test shows the library behavior:

TEST(JsonSpiritTest, handlesUnicodeOnlyIfWideChars) {
    EXPECT_EQ(L"\"\\u9EC4\\u74DC\"", json_spirit::write_string(wmValue(L"\u9EC4\u74DC"), false));
    EXPECT_NE("\"\\u9EC4\\u74DC\"", json_spirit::write_string(mValue("\u9EC4\u74DC"), false));
}

Unfortunately the JSON serialization code in CukeBins is ugly, so I'll try to refactor it while fixing the bug.

from cucumber-cpp.

paoloambrosio avatar paoloambrosio commented on July 19, 2024

Done a few tests. The components that might have problems are the wire protocol codec (currently using JSON Spirit) and the regular expression matcher. C++ support for unicode and regular expressions has been standardized only with C++0x that is still not an option. In C++03 source code encoding is ASCII only, so even CukeBins regular expressions should be encoded using the \u escape character and wide strings. Please note that MSVC is an example of compiler where wchar_t is 16 bits, so using wide strings would not solve the problem. My proposal for the moment is to treat every char (8-bit) sequence as UTF-8, handled by MSVC and GCC, and...

JSON Spirit

  • convert every string to wchar_t before decoding or encoding
  • if unicode support is disabled, fail on non-ASCII codes

Boost 1.48+ comes with the new Locale library that handles UTF quite well. I still haven't come to a conclusion on how to deal with the conversion without ICU or Boost Locale. I might introduce a new dependency from ICU for full unicode support (with any Boost version) or Boost 1.48+ without ICU for partial support.

Edit: since JSON is encoded in UTF-16/UCS-2 like JavaScript, and since we don't care about counting, surrogate pairs should not be a problem, so I removed the case where wchar_t can't hold UTF-32 code points.

Regex

  • use boost::u32regex if Boost is compiled with ICU support
  • use boost::wregex if wchar_t is 16 bits and fail on surrogate pairs
  • use boost::regex and fail on non-ASCII codes

Here there is a brief explanation of Boost Regex unicode support.

from cucumber-cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.