Epigraph: <div class="snippet-clipboard-content notranslate position-relative over

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Dealing with Python3 string-like types zoo about micropython HOT 20 CLOSED

micropython commented on July 22, 2024

Dealing with Python3 string-like types zoo

from micropython.

Comments (20)

chipaca commented on July 22, 2024

As I understand it, micropython is “a lean and fast implementation of the Python 3 programming language”. If that is the case (and if it weren't the case I wouldn't've backed the project), then making it behave differently from Python 3 WRT unicode strings goes directly against that. If unicode strings are really too much, then don't have strings at all -- just bytes would be enough. If I write Python 3 code that uses all the features in micropython, and it produces different output on micropython and on Python 3, then it's no good.

from micropython.

pfalcon commented on July 22, 2024

and it produces different output on micropython and on Python 3, then it's no good.

Well, everything discussed here is implementation details of how to make it possible for micropython to have exactly same output as Python 3, while not using so much memory.

If unicode strings are really too much, then don't have strings at all -- just bytes would be enough.

That's indeed makes sense - you don't need unicode to start blinking LEDs. And I kinda hint that it makes sense to recast what's currently implemented as byte strings (just need to support "b" prefix).

On the other hand, it would nice to consider object layouts to support further string types down the road w/o obtrusive redesigns, what is exactly the subject of this ticket.

from micropython.

chipaca commented on July 22, 2024

Well, everything discussed here is implementation details of how to make it possible for micropython to have exactly same output as Python 3, while not using so much memory.

Ah. It reads a bit like you're ranting against Python 3's strings and
proposing micropython's strings work like Python 2's. If that's not
the case, then I'm fine with it.

from micropython.

dpgeorge commented on July 22, 2024

To be clear, my intentions were, are, and will be, to have Micro Python as compatible with Python 3 as possible. At the worst, uPy will be a subset of Python 3, such that if it runs on uPy, it runs on Python 3.

Bytearrays (the mutable one) are very, very useful for microcontrollers (you can use them as a buffer, for example). They have one straight forward implementation (a byte array).

Strings will be unicode, stored as UTF-8 I think is best. Perhaps you could have an option to store them as 32bit wide characters. Note that UTF-8 storage has zero overhead on RAM for ASCII-only strings, compared with just 8bit storage of an ASCII string. If you were really pressed for speed, then you could restrict what you accept as a unicode code-point to lie in the ASCII range (1-127) and implement your unicode_next() function as simply a pointer increment.

from micropython.

piranna commented on July 22, 2024

If you were really pressed for speed, then you could restrict what you
accept as a unicode code-point to lie in the ASCII range (1-127) and
implement your unicode_next() function as simply a pointer increment.

Would make sense to have this as a compilation option as a particular
optimization? Seems easy to do and don't break compatibility (up to some
degree...) with CPython 3.3... For the 99% of string operations with
MicroPython they could be done on the ASCII range, so they can take
advantage of this optimization, and if they're required full-fledged
compliant Unicode strings, just disable the flag and you are go.

"Si quieres viajar alrededor del mundo y ser invitado a hablar en un monton
de sitios diferentes, simplemente escribe un sistema operativo Unix."
– Linus Tordvals, creador del sistema operativo Linux

from micropython.

dpgeorge commented on July 22, 2024

Yes, it would be a compile time optimisation, enabling you to disable unicode without changing any of the string handling framework.

from micropython.

pfalcon commented on July 22, 2024

@dpgeorge: Sure, that's all more or less clear. This ticket goes further and contemplates how to implement all that. So, do you agree that 8 string-like types listed above need to be represented? Or more? Or less? Do you agree that it makes no sense to try to fit distinction among them into tag bits of mp_obj_t? Then, do you agree that it makes sense to take few bits away from var-length string size encoding to store those bits is the same byte? (Well recursive question - do you agree that it makes sense to use var-length encoding for string size? - I don't remember your ack for that in #8, but as you said, you use varlen for qstr handles encoding already, so I wouldn't think you would object to it.)

And if you just skimmed thru the description text and don't have time for this so far, no worries - I just wanted to record my ideas before holiday time is over and I need to go back to work stuff and thus risk losing some details.

from micropython.

pfalcon commented on July 22, 2024

For the 99% of string operations with MicroPython they could be done on the ASCII range

I don't know where that 99% comes from. Anyone whose language is not based on Latin script will tell you that's not true. Heck, even my HD7780 LCD has chargen with 256 chars, with Cyrillic or Chinese symbols.

On the other hand, it's possible to do 80% of all operations on any encoding (including Unicode) with just 8bit-clean strings. If you add strlen operation for a particular encoding - as a separate function - you can cover 90%. Add substr for a particular encoding and that covers 95% of all operations. And that all without forcing a particular encoding on everyone (even if it's Unicode). But Python3 forced Unicode on everyone, and now for MicroPython that choice needs to be worked around, and I personally don't find workaround of forcing ASCII as "optimization" to be good at all.

from micropython.

piranna commented on July 22, 2024

For the 99% of string operations with MicroPython they could be done on
the ASCII range

I don't know where that 99% comes from. Anyone whose language is not based
on Latin script will tell you that's not true. Heck, even my HD7780 LCD has
chargen with 256 chars, with Cyrillic or Chinese symbols.

I admit 99% is an arbitrary number, but your 80% is totally acceptable.
English is the current lingua-franca, and it can be represented just by
ASCII characters. Except when dealing directly with localiced UIs, it's
perfectly acceptable to just use english (ASCII) for the strings used
inside MicroPython, and if needed to use full Unicode, you could check
about if the speed is acceptable or it makes more sense to generate the
localiced strings outside of the microcontroler (for example, dispatch JSON
objects and generate the location on a desktop, or dispatch the english
strings and translate later with gettext).

from micropython.

dpgeorge commented on July 22, 2024

@pfalcon yes, I agree to using varlen encoding in qstrs.

Can't qstrs be used to just store an array of bytes? And not care about the encoding? A qstr would be: (bytelen,hash,data), with bytelen encoded as a variable length integer. The data is just bytes.

Different string representations (ASCII, UTF-8, 16-bit, 32-bit) would then be represented by the Python object. You would probably just compile with 1 particular representation.

bytes are always represented as a string of bytes, and can use the same qstr API, since qstr just stores data.

bytearray will be a mutable array of bytes, and not use qstr.

from micropython.

pfalcon commented on July 22, 2024

Can't qstrs be used to just store an array of bytes? Different string representations (ASCII, UTF-8, 16-bit, 32-bit) would then be represented by the Python object.

Bingo! You know, one of the first headers I peered into was parse.h and its tagged pointers and it seems I had a strange mix-up in my head on how what where represented. Of course, on Python level all objects start with mp_obj_base_t and that's how they're differentiated and interpreted, no need (likely ;-) ) to fit some tag bits into qstr.

Then, should hash be part of qstr (and not part of Python object)? Well, the reason I fitted it there is because qstr is used in level deeper than Python objects, where quick search and comparison is still required. The problem here is that hash should depend on semantics of object, for example, for Unicode, individual (32bit) chars should be hashed, so hash of UTF-8 and UTF-16 string with same chars was the same. Well, we can accommodate that by passing hash value to store from upper levels. With all that, there's still benefit of storing hash in qstr because we allocate 1 byte, and anything added to mp_obj_*_t would likely take 4 bytes due to struct alignment.

bytearray will be a mutable array of bytes, and not use qstr.

Well, I kind of thought about making generic storage infra for string-like types. bytearrays are exactly like bytes, except it's mutable (and I use qstr in a bit looser manner than it is now, in particular, qstr doesn't imply "interned", per your idea of supporting both interned and non-interned).

from micropython.

pfalcon commented on July 22, 2024

>>> b = bytearray(b"123")
>>> hash(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'bytearray'

Crazy. So, there's not much use of storing bytearray in qstr - I though we could cache hash in it (and support lazy hashing), but it's not supported on language core level. And given:

>>> b.append(1)
>>> b
bytearray(b'123\x01')

, bytearary would rather be type with both size and allocsize fields (and varlen for size would just complicate stuff).

from micropython.

dpgeorge commented on July 22, 2024

Yeah, bytearray is really more like list than str.

from micropython.

dpgeorge commented on July 22, 2024

How about this CPython behaviour for a null character in a string:

>>> t = type('a\x00b', (object,), {})
>>> t
<class '__main__.ab'>
>>> t()
<__main__.ab object at 0x7f7022e13810>
>>> t.member
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'a' has no attribute 'member'

Thus, while CPython allows you to create a type whose name has a null character in it, and it prints such a name correctly in some cases (the first 2 outputs with __main__.ab) it prints it as an ASCIIZ string in other cases (the error it throws, it just prints a).

If we allow qstrs to have null characters, then everywhere we convert qstr to its representation we will need to account for the fact it's a pointer and length.

A good option is to make a custom printf format specifier for qstrs that handles them correctly (or just use %.*s).

from micropython.

Arachnid commented on July 22, 2024

On Thu, Jan 2, 2014 at 10:48 PM, Paul Sokolovsky
[email protected]:

For the 99% of string operations with MicroPython they could be done on
the ASCII range

I don't know where that 99% comes from. Anyone whose language is not based
on Latin script will tell you that's not true. Heck, even my HD7780 LCD has
chargen with 256 chars, with Cyrillic or Chinese symbols.

On the other hand, it's possible to do 80% of all operations on any
encoding (including Unicode) with just 8bit-clean strings. If you add
strlen operation for a particular encoding - as a separate function - you
can cover 90%. Add substr for a particular encoding and that covers 95% of
all operations. And that all without forcing a particular encoding on
everyone (even if it's Unicode). But Python3 forced Unicode on everyone,
and now for MicroPython that choice needs to be worked around, and I
personally don't find workaround of forcing ASCII as "optimization" to be
good at all.

Unicode isn't an encoding; UTF-8 is an encoding of Unicode codepoints, and
(retrospectively) ASCII is too, of a small subset of unicode codepoints.

There's a lot of english centric discussion going on here. It's important
to recognise that not 100% (not even 50%) of the world supports English.
Unicode support should be a basic feature of a modern programming language,
and personally I'd want to see some pretty firm figures on overhead from
unicode string support before any "size matters" argument holds sway - so
far there's nothing but a lot of hand waving.

-Nick Johnson

—

Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-31491783
.

from micropython.

pfalcon commented on July 22, 2024

How about this CPython behaviour for a null character in a

Well, that's pretty edge case ;-). I guess it comes from dichotomy that qstr (and their CPython analog) is used to represent not just arbitrary strings, but also identifiers as used in language syntax. Of course, identifiers have other constituent character requirements.

So, well, we can cheat/overlook how we print qstrs representing identifiers, but of course not user data. I had idea about "%*s" syntax too, but it seems it allows to limit length, not extend it beyond \0:

printf("%.*s|\n", 5, "ab\0cd");

gives me:

ab|

So, custom printf formatter may be interesting idea. (Though I still not sure I understand how you handle repr() vs str() difference).

from micropython.

dpgeorge commented on July 22, 2024

Saw this, regarding putting back % operator for bytes and bytearray:

https://mail.python.org/pipermail/python-dev/2014-March/133621.html

Thought immediately of @pfalcon :)

from micropython.

pfalcon commented on July 22, 2024

@dpgeorge : Greate news! ;-) Opened #403 to cover that.

from micropython.

pfalcon commented on July 22, 2024

Btw, I wanted to mention that I kinda feel that we should keep trailing null byte for str/bytes around ~forever. Motivation: interoperability with native C APIs. Ref: https://mail.python.org/pipermail/python-dev/2014-April/134398.html

from micropython.

pfalcon commented on July 22, 2024

We now have proper support for bytes and unicode strings, closing this.

from micropython.

Dealing with Python3 string-like types zoo about micropython HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent