Hi, this little iteration would be nice to always have at hand when

Hi, here our nested encoder: <a href="https://github.com/axiros/nest

Encoding unicode within nested structures (like from json.loads) about boltons HOT 6 CLOSED

mahmoud commented on May 18, 2024

Encoding unicode within nested structures (like from json.loads)

from boltons.

Comments (6)

mahmoud commented on May 18, 2024

Sorry for the delay, I've been traveling lately.

Thanks for the submission! I certainly feel the struggle with the bytestring and unicode split. That said, I think that json uncontroversially is not a byte-oriented serialization format. No native support for bytes is actually one criticism levelled against JSON. All strings are text and unicode is Python 2's text-representing object.

I am, however, curious where there are specific areas where the bytes are more useful than the text.

from boltons.

AXGKl commented on May 18, 2024

Hey Mahmoud,

I understand your refusal to put this in since in line with the 'official' line of communication, pointing to Ned Batchelders talk and the Py3 text handling, which is unicode only.

I am, however, curious where there are specific areas where the bytes are more useful than the text.

Problem with that is that for many many application outside of human text processing, the Py2 string type is far better suited, not to say the only feasible way of working. I'm talking specifically about processing standardised payloads - where unicode is pretty much not present, e.g. for inter systems communication (covering megatrends like Internet of Things, M2M, ...).

I'm planning to write a bigger blog sometimes on this - for now I can only offer Armin Ronacher's many blogs about the problem with Unicode and a little minirant of mine recently here: https://lwn.net/Articles/641808/ as a justification of this claim. Like Armin we see Py2 the better language - because of its far superior possibilities to work with human text AND standardized text. 'Modern' languages like go or rust are also built on that paradigm. Check also the UTF-8 Manifesto regarding Python3...

So: IF applications prefer for those reasons internal text as byte strings - then a Json deserializer into bytes would be pretty nifty to have at hand always.

Json is everywhere for Microservices and other communication, and due to its origins from JS, which finds its roots as an enhancer for human text (html) output it is unicode only. It became a de-facto standard NOT because identifieres could be unicode but because its arbitrary key / nested structures.

Let me add another point: IMHO, if they would have not refuted to set the defaultencoding away from ASCII in Py2 so violently, then all the infamous complaints about Py2 "broken" unicode handling especially from the English speaking world would not be present. But thats just a sidenote ;-)

from boltons.

mahmoud commented on May 18, 2024

I'm a longtime user and fan of Python 2 myself, Gunther. In fact, the initial public release of boltons didn't even have Python 3 compatibility out of the box.

Which quickly brings me to my first and only point: this is not a Python 2 vs Python 3 issue. JSON simply does not ever store bytes. It only supports ints, floats, lists, maps, nulls, and text.

Furthermore, JSON can be encoded as UTF-8, 16, or 32, meaning blanket re-encoding of the text objects may not only be an undue performance hit, it may also be not the same bytes you got.

I feel your pain. The str type is a good one, and I've fought battles trying to protect its legacy too, but this isn't one of them. If this were a generic recursive map()-like helper for dicts that would be one thing, but specific to the context it's just not correct.

from boltons.

markrwilliams commented on May 18, 2024

No matter what the default Unicode policy is in Python, there's a strict one in JSON.

You can't pretend that JSON can represent arbitrary sequences of bytes:

>>> json.dumps({'a': '\xc3'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: invalid continuation byte

If you do want to put strings that have defined encoding (binary data) into a JSON object, you must first encode it into something that can be encoded as UTF-8 (or UTF-16 or 32).

It turns out sending binary data through mechanisms that can't understand it is an old problem with an old solution:

>>> json.dumps({'a_base64': base64.b64encode('\xc3')})
'{"a_base64": "ww=="}

But to get your data back, you have to know which fields were base64 encoded. Unfortunately you can't really do this with as simple a recursive function as you linked; you need something a little more complicated.

By now this comment probably reads a lot Armin Ronacher's "Unicode in Python 3" blogposts, which you mentioned. That's not a coincidence! It's the same problem but with JSON instead of Python 3. And implicit decoding of binary data as it crosses systems boundaries, which this issue would add, would make the situation more like what Armin complained about. That's the real problem addressed here -- Python 3 attempts to guess the encoding of, say, stdin when there may not even be one!

It's best to not commit the same error again, but with JSON.

from boltons.

AXGKl commented on May 18, 2024

Hey Mahmoud although you dismissed it I have a smile on my face, so that Issue on bugs.python was from you, this was great!!

I'm planning to ride yet another "attack" on their (imho) crazy decision to decode all input at ingress w/o need and w/o knowing the application and the industry standard of the payload.

The more I read about the history of that decision, the more I realise that it was based on problems with this historic engineering f*up from the windows world, to apply IBMs' idea of using > 127 bits for drawing fancy menus into the domain of *human text, encoding symbols with semantic meaning but not putting the encoding inline but storing it in localised versions of the OS itself. Crazy, because the latter is stored usually long term and hit other programs in the future. Could write three hours about that and I'll set up a blog once I find the time ;-)
In any case the problem is meanwhile nearly overcome, thanks god, but still we find Py3's core architecture severely fubared because of that. Which is a nightmare, since they don't even seem to understand that fact, reading the replies to your Issue.

Regarding json: Yes, all the red ones here (http://en.wikipedia.org/wiki/UTF-8#Codepage_layout) can't be encoded due to the fact that json is not a real standardised transfer protocol but a defacto standard from a language originally created to enhance human texts - javascript. But its ok, we use base64 whenever we intend to transmit raw binary via json.

I think my request was misunderstood, due to a misleading title:
I just was asking for a bolton recursive encoder statement, for nested structures full of unicode (which json.loads returns). Sth like recursive_encode(m, encoding='utf-8'). I try to change the title, I just learned that also a soap lib we use delivers unicode structures, so its not only for json.

PS: Sorry for writing PEP when I meant Issue ;-) Just learned that in my organization a nested_encode function implemented in C exists, with just a 5% performance overhead compared to json.loads for smaller structures. We are building a pip install package from that, will let you know when ready.

from boltons.

AXGKl commented on May 18, 2024

Hi, here our nested encoder:

https://github.com/axiros/nested_encode

performance overhead around 10%, here a stupid test run:

http://stackoverflow.com/a/30030529/4583360

from boltons.

Encoding unicode within nested structures (like from json.loads) about boltons HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent