qstr uniqueness handling is overkill!,about micropython/micropython

Comments (39)

pfalcon commented on July 22, 2024

Btw, did you consider design where some strings are interned and some not? I can say from my experiments (with other language) that non-interning design definitely spends more memory (there are lot of short strings which are used again). On the other hand, common sense says that wasting time to hash random strings destined just for printing makes no sense...

from micropython.

dpgeorge commented on July 22, 2024

The implementation of qstr (unique strings) and it's use needs thought.

Using qstr's to store the string of an actual Python string object is a quick hack, really. We can change it so that Python string objects are not interned (turned into a qstr).

I wanted to have it so that the initial qstr's can be put in read-only memory. That is, have part of the qstr table fixed at compile time. This basically implements enums but with a string representation.

I'm brave enough to write a custom preprocessor if needed!

from micropython.

pfalcon commented on July 22, 2024

Using qstr's to store the string of an actual Python string object is a quick hack, really.

Well, string interning is a common design pattern since early LISP interpreters and to present day, as http://en.wikipedia.org/wiki/String_interning shows a lot of modern languages keep using it. What's more important for the case that unbloated languages like Lua or Squirrel also use it. So, interning in MicroPython is definitely not a quick hack, but adherence to the best practices ;-).

I did some testing on Squirrel (I spent hacking on it entire summer, and that's after ~2 years of consideration of whether I should bend me into Helsinki syndrome for Lua, try to continue with TinyPy or PyMite, or something else - yeah, I wish MicroPython came up a bit earlier ;-) ), and found that non-interning of strings leads to higher memory usage just for start up (i.e. just for initing method tables of objects and stuff) - something I didn't expect to be so noticeable even for interpreter startup. So, common sense says that interning is not always useful, but getting rid of it completely is unlikely a good idea.

I wanted to have it so that the initial qstr's can be put in read-only memory.

Great, then you understand my worry about both supporting hashability of strings and supporting it for static strings either. I wish there were a language which could sanely represent strings, arrays, dictionaries (well, dictionaries' static support is definitely needed, because that's how object methods/fields are supposed to be stored - though I didn't review that part of uPy code yet).

from micropython.

dpgeorge commented on July 22, 2024

I did always plan to have qstr's hashed, but left that excercise for "some time down the road". Maybe that time is now :)

I would want to have the global qstr table part static (for known keywords/names like "len" and "range") and part dynamic (for method names, etc that the user has in their script).

Therefore, we need a way to have static hashing. Probably that means a custom preprocessor, or maybe just 1 file that is auto-generated and has all the static qstr's in it.

Finally, I think it's a good idea to have 2 different kinds of Python string objects: one that is represented as a qstr, and one that is represented as a normal dynamically allocated array of char's. Of course, the end user does not see any difference between the 2, as they both have the same methods and type signature. The qstr objects are ones that are made directly in the script (eg x = 'abc'), the array of char ones are made dynamically in the script, eg x= 'abc'+'def', and y = '{} {}'.format(1,2)

from micropython.

pfalcon commented on July 22, 2024

Maybe that time is now :)

Well, I didn't want to imply that everything should be dropped and this issue fixed ASAP ;-), just wanted to record issues I see as I go thru the code. It all should be ordered by priority, for example #14 is definitely more important.

I'm glad that we agree on the approach, and thanks for sharing other points like when to use interned vs raw strings - indeed, seems to be viable!

from micropython.

pfalcon commented on July 22, 2024

I'd suggest to take a chance and store length with string.

Actually, on a second thought (d'oh!) we must store length together with a string, because Python strings can contain '\0', so current implementation using strcmp() and friends isn't really compliant.

from micropython.

dpgeorge commented on July 22, 2024

I didn't realise Python strings can contain '\0'... bytearray yes, but strings not. In the language reference it says strings can only contain "source characters", ie unicode characters in UTF-8 encoding. I didn't think '\0' was a unicode character.

But, a simple test in Python ('\000') shows indeed that you can have '\0' in a string.

from micropython.

dpgeorge commented on July 22, 2024

If we want interned strings to live along side non-interned strings (in Python, in eg a dict), then, eg, interned "abc" must hash to the same value as non-interned "abc". Thus we can't use the qstr ptr or qstr index of the interned string as the hash value. We must compute the hash of the actual data. And so, for efficiency, we need to store the hash of an interned string.

Proposal: the first 4 bytes of the qstr are the hash. For compact memory storage it can be done the following way: first store the variably-encoded length, then optional hash bytes (the number of hash bytes depend on length of string), then the data. More explicitly (L=len byte, h=hash byte, d=data byte):

0 0 0 0 // empty string
1 h h d // 1 byte data
2 h d d // 2 bytes data
3 d d d // 3 bytes data
L h d d d d ... // 4-7 bytes data
L h h d d d ... // 8-127 bytes data
L L h h d d d ... // when length needs 2 bytes to store it (eg 128 <= len <= 16383)
L L L h d d d ... // when length needs 3 bytes to store it

Such data could be 4-byte aligned, or not.

from micropython.

pfalcon commented on July 22, 2024

Well, since the beginning of the discussion I had different layout in head, sorry if I doesn't come out clear - I indeed mixed up few things. Let me try to summarize what I had in mind with requirements behind it.

Requirements:

All identifiers (keywords, buildtin object, function, method names) in uPy core are interned - it means that comparison by ptr gives 100% probability of match/mismatch.
It should be possible to intern/check internship of any string - quickly. This means that we should do match by hash and to strcmp() only if hashes match.
Minimal practical size of hash is one byte. That would theoretically give 256x speed up on string comparisons, but let's go conservative and (arbitrarily) say we'll get only 100x speedup due to imperfection of hash function. Still good enough for _Micro_Python IMHO. (If you're concerned that 8 bits of hash are too few, see below).
We should support non-interned strings. Then, if they're used as keys for a dictionary, we can't rely on ptr, so would need to use hash value. That means that storage layout of interned and non-interned string should include hash => it (storage layout can be the same).
Hash function is expected to be quick, which means that if you loop over string for some reason, you can in parallel compute hash. But there's still important usecase when: a) we get a string from outside (I/O) as is, i.e. won't necessarily iterate over it; b) a string can be long (consider read(64K) or read(1M)), c) hash is needed only for few operations (dictionary lookup is the only 100% case). That calls for condition that hashing of non-interned can be lazy, and we need to signify "hash is not computed" case.

This leads to following layout

hash (1 byte) - as hash has fixed size (1 byte), it can go first (also, hash is the most discriminating string param, and *p is more efficient than *(p+N) for some CPUs)

flags and len lsbyte (1byte): bit 7: hash_valid, bit 6: more_length, bits 5-0: length lsbits (potentially, store few more flags here)

more length: if bit 6 of previous byte set, another len byte is here, if its bit 7 is set, another len byte follows, etc.

data: len bytes
C-compatible \0 term: 1 byte - we would need to go a bit out of way to get rid of it, could NOT do that.

That means that minimal-length non-empty string would be 4 bytes, which matches 32bit CPU alignment.

Extra features:

Concern: 8 bits of hash too few, for example, consider hash table with > 256s slots. Well, we can use following length byte as additional discriminator to get 16 bits of "hash". For _micro_python, that's definitely enough.
Actually, when doing string search, as optimization, we can match on 32-bit values at the beginning of the string. Beginnings of string ideally should be 32-bit aligned, but note that's the strict requirement only for Coretx-M0 & (classical) MIPS (dunno about Sparc, would assume too), but Cortex-M3 and higher and x86 can handle non-aligned access transparently.

from micropython.

piranna commented on July 22, 2024

Nice argumentation, I like it :-) Only point, since we have already the
string lenght, I would not store C-style null-terminated strings, but
instead I would use Pascal-style (lenght prefixed) strings that would solve
the problem of storing null characters in the middle of an string and also
improve security requiring us to use length-requiring functions (strncmp
instead of strcmp, for example).

http://en.wikipedia.org/wiki/String_%28computer_science%29#Representations
http://stackoverflow.com/questions/7648947/declaring-pascal-style-strings-in-c

2014/1/10 Paul Sokolovsky [email protected]

Well, since the beginning of the discussion I had different layout in
head, sorry if I doesn't come out clear - I indeed mixed up few things. Let
me try to summarize what I had in mind with requirements behind it.

Requirements:

All identifiers (keywords, buildtin object, function, method names)
in uPy core are interned - it means that comparison by ptr gives 100%
probability of match/mismatch.

It should be possible to intern/check internship of any string -
quickly. This means that we should do match by hash and to strcmp() only if
hashes match.

Minimal practical size of hash is one byte. That would
theoretically give 256x speed up on string comparisons, but let's go
conservative and (arbitrarily) say we'll get only 100x speedup due to
imperfection of hash function. Still good enough for _Micro_Python
IMHO. (If you're concerned that 8 bits of hash are too few, see below).

We should support non-interned strings. Then, if they're used as
keys for a dictionary, we can't rely on ptr, so would need to use hash
value. That means that storage layout of interned and non-interned string
should include hash => it (storage layout can be the same).

Hash function is expected to be quick, which means that if you loop
over string for some reason, you can in parallel compute hash. But there's
still important usecase when: a) we get a string from outside (I/O) as is,
i.e. won't necessarily iterate over it; b) a string can be long (consider
read(64K) or read(1M)), c) hash is needed only for few operations
(dictionary lookup is the only 100% case). That calls for condition that
hashing of non-interned can be lazy, and we need to signify "hash is not
computed" case.

This leads to following layout

hash (1 byte) - as hash has fixed size (1 byte), it can go first (also, hash is the most discriminating string param, and *p is more efficient than *(p+N) for some CPUs)

flags and len lsbyte (1byte): bit 7: hash_valid, bit 6: more_length, bits 5-0: length lsbits (potentially, store few more flags here)

more length: if bit 6 of previous byte set, another len byte is here, if its bit 7 is set, another len byte follows, etc.

data: len bytes
C-compatible \0 term: 1 byte - we would need to go a bit out of way to get rid of it, could NOT do that.

That means that minimal-length non-empty string would be 4 bytes, which
matches 32bit CPU alignment.

Extra features:

Concern: 8 bits of hash too few, for example, consider hash table with

256s slots. Well, we can use following length byte as additional
discriminator to get 16 bits of "hash". For _micro_python, that's
definitely enough.
2. Actually, when doing string search, as optimization, we can match on
32-bit values at the beginning of the string. Beginnings of string ideally
should be 32-bit aligned, but note that's the strict requirement only for
Coretx-M0 & (classical) MIPS (dunno about Sparc, would assume too), but
Cortex-M3 and higher and x86 can handle non-aligned access transparently.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-32069605
.

"Si quieres viajar alrededor del mundo y ser invitado a hablar en un monton
de sitios diferentes, simplemente escribe un sistema operativo Unix."
– Linus Tordvals, creador del sistema operativo Linux

from micropython.

pfalcon commented on July 22, 2024

Only point, since we have already the string lenght, I would not store C-style null-terminated strings

Well, want I meant is that C compiler will store \0 for any literal string anyway. The only way to get around that is to use array init syntax: char foo[4] = "\x03foo" will get you "Pascal-style" string of exactly 4 bytes length. But some_func("\x03foo") and char *s = "\x03foo" will actually store in flash 5-byte string. For the purpose of our discussion, that trailing \0 is just a padding byte, driven into picture just to argument the acceptance of the fact that minimal size of string would be 4 bytes, which in turn enables potential optimization of comparing string headers as int32_t values.

We of course not going to store \0 for dynamic strings (not coming from string literals). For example, dynamic string of 6 bytes will take strictly 8 bytes (6 bytes content + 1 byte hash + 1 byte (small) len).

from micropython.

piranna commented on July 22, 2024

I see. So, you are saying that it shouldn't give any problems beside adding
the end null character, isn't it?

Send from my Samsung Galaxy Note II
El 10/01/2014 23:31, "Paul Sokolovsky" [email protected] escribió:

Only point, since we have already the string lenght, I would not store
C-style null-terminated strings

Well, want I meant is that C compiler will store \0 for any literal string
anyway. The only way to get around that is to use array init syntax: char
foo[4] = "\x03foo" will get you "Pascal-style" string of exactly 4 bytes
length. But some_func("\x03foo") and char s = ""\x03foo"" will actually
store in flash 5-byte string. For the purpose of our discussion, that
trailing \0 is just a padding byte, driven into picture just to argument
the acceptance of the fact that minimal size of string would be 4 bytes,
which in turn enables *potential optimization of comparing string
headers as int32_t values.

We of course not going to store \0 for dynamic strings (not coming from
string literals). For example, dynamic string of 6 bytes will take strictly
8 bytes (6 bytes content + 1 byte hash + 1 byte (small) len).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-32073160
.

from micropython.

dpgeorge commented on July 22, 2024

One use of storing the null byte at the end is so you can parse strings without having to always check that you are within the string data boundaries. Eg, so long as I'm not looking for the null character, I can do a look-ahead check for some character, knowing that if I get a match then that match is before the end of the string (else I would hit the null character and not get a match).

Anyway, adding the null byte is a small issue.

In the end, I think @pfalcon and I argee with most things, it's just the way we are saying it.

I think we should use 2 hash bytes, and use the length as the second byte (ie H L ...).
We need hash for 2 things: to quickly lookup a string in the intern pool, to check if it's there or not; and as a hash value so it can be inserted into a dictionary.
The ability to have interned and non-interned strings in the same dictionary makes it a bit ugly/inefficient. If it was all interned, you could just check pointers for equality (and even use pointers as the hash value). But with a mix of interned/non-interned, you now need to check the actual values for equality (ie the string bytes). Using a hash value speeds this up, but if the hashes match you still need to check the bytes (eg using strcmp). In the current implementation of mp_map_t it actually has a flag to tell when the map has all interned strings, or not. So this is not really an issue, but something to be aware of.
Is it really worth doing lazy hash evaluation for non-interned strings? Just compute the hash when you create the string and that's it. If hash is quick, O(N), this won't be too bad (since O(N) already for creation).

from micropython.

pfalcon commented on July 22, 2024

We need hash for 2 things: to quickly lookup a string in the intern pool, to check if it's there or not; and as a hash value so it can be inserted into a dictionary.

Yeah, I can't immediately come up with other uses for hash (for example, comparing 2 non-interned strings, we don't need to compute hash - we can use it if it's there, but otherwise, computing hash will take ~ same amount of time as comparing chars directly). But those 2 uses are rather important to warrant having hash in qstr.

The ability to have interned and non-interned strings in the same dictionary makes it a bit ugly/inefficient.

Well, we're not JavaScript here - python dicts can contain arbitrary type of keys, so we won't get around using object hash for that. And with all-interned optimization that you have, we're golden IMHO.

Is it really worth doing lazy hash evaluation for non-interned strings? Just compute the hash when you create the string and that's it. If hash is quick, O(N), this won't be too bad (since O(N) already for creation).

Well, I gave example where hash calc is superfluous and expensive - file.read() a big string, then write it somewhere. On our side it's O < O(N) (only allocation time, which is O(1) with ideal allocator). Then OS will read in string for us. It won't compute hash in parallel with reading, so we'll need to do that afterwards, spending O(N) time, which will be waste if we won't use that string as a key in dict. But I agree this case is an optimization, not giving (too) much for MCU case, so we can leave it for later.

from micropython.

Neon22 commented on July 22, 2024

Another adv of adding byte length as part of string - then strcmp can cmpare this first and fail if not same.
Question - this is just for Qstr right so never get unicode in here. My question is related to possibility of swelling struct by one more byte for type of string. so (H L T ...)

from micropython.

pfalcon commented on July 22, 2024

this is just for Qstr right so never get unicode in here

Per previous discussion, qstr is generic container type for any immutable strings. It can be bytes, utf-8, utf-16, utf-32, etc.

My question is related to possibility of swelling struct by one more byte for type of string

We definitely don't want to swell it by whole byte, but above I already proposed to add few more bits to first varlen byte (to still keep 2 bytes of overhead for short strings).

So, it would be: bit 7 - "hash not valid", bits 6-5: content type (00 - bytes, 01 - utf-8, 10, 11 - reserved for custom implementations), bit 4 - "more length bytes", bits 3-0 - length lsbits.

from micropython.

dpgeorge commented on July 22, 2024

Yes, we can add bits to indicate the type, if needed.

I think I just want to get the framework for this up and running first (ie a custom "preprocessor" that works out the static const hashes, etc), and then we can work out exactly what bits of information we need to store.

from micropython.

pfalcon commented on July 22, 2024

custom "preprocessor"

I had an idea how to implement that much more automagical than I first thought. For reference, I thought we'd need something like:

Q("\x12", "\x03", "foo")

each time in source we need a static hash-len qstr, and preprocessor would go thru each source file and update it in place to make sure hash and length values are actual.

Well, no need for that Q("foo") in source and a generate header will be enough for 99% cases.

from micropython.

pfalcon commented on July 22, 2024

But first let's consider how we can get automagical interned static strings. As I said, having all keywords/identifiers used in source to be interned string should be one of the top goals of refactor, and here's how to get it for free (i.e. compile-time, not runtime).

The keyword is string pooling, which is on by default in gcc (and apparently not even turnable off any longer). Demo of the idea:

#include <stdio.h>

void foo(const char *s)
{
    printf("%p ", s);
    if (s == "foo")
        printf("matched!");
    printf("\n");
}

struct entry {
    char *key;
    char *val;
};

const char arr[] = "foo";
const char *ptr = "foo";
const struct entry table[2] = {
    {"foo", "sth"},
    {NULL}
};

int main()
{
    printf("#1 literal\n");
    foo("foo");
    printf("#2 pointer\n");
    foo(ptr);
    printf("#3 pointer in struct\n");
    foo(table[0].key);
    printf("#4 array\n");
    foo(arr);
}

So, as you see, only array syntax produces non-pooled string value, other string syntaxes are automagically "interned" in C app string table.

from micropython.

pfalcon commented on July 22, 2024

And now impl of hash-len strings:

#include <stdio.h>

#define Q(str) QH_##str QL_##str #str

#define QH_foo "\x55"
#define QL_foo "\x03"

#define QH_1 "\xaa"
#define QL_1 "\x01"

void foo(const char *s)
{
    printf("%p ", s);
    if (s == Q(foo))
        printf("matched!");
    printf("\n");
}

const char *ptr = Q(foo);

int main()
{
    printf("#1 literal\n");
    foo(Q(foo));
    printf("#2 pointer\n");
    foo(ptr);
    Q(1);
}

from micropython.

pfalcon commented on July 22, 2024

So, preprocessor would need to just go thru source and generate separate header with

#define QH_foo "\x55"
#define QL_foo "\x03"

defines. This will work for any string which contains identifier-clean chars (may even start with digits). Again, my guess that covers 99%f our needs. If we ever need to hash-len "+" or something, we can precompute hash by hand, and just put "\xNN" "\x01" "+" in source for such exceptional case ;-)

from micropython.

dpgeorge commented on July 22, 2024

Nice.

Problem: we are relying heavily on the way gcc optimises strings. What if a particular C compiler for another architecture behaves differently? Then some qstrs match, and some don't.

What about:

#include <stdio.h>

#define Q_DEF(s) const char qstr_##s[] = "\x12\x34" #s;

#define Q(s) qstr_##s

Q_DEF(foo)
Q_DEF(1)

void foo(const char *s) {
    printf("%p ", s);
    if (s == Q(foo))
        printf("matched!");
    printf("\n");
}

const char *ptr = Q(foo);

int main(void) {
    printf("#1 literal\n");
    foo(Q(foo));
    printf("#2 pointer\n");
    foo(ptr);
    Q(1);
}

The idea would be to generate the Q_DEF.

from micropython.

dr2chase commented on July 22, 2024

Perhaps late to the party:

Would you consider
byte 1 = hash (and if hash is not valid, is zero, but some valid hashes can have zero here)
byte 2, bit 7 = more_length (short strings always have hash computed, they are short)
byte 2, bit 6 = is UTF8
byte 2, bits 5-0 = len lsbits

For longer strings, record hash_is_set in bit 6 of byte 3
byte 3, bit 7 = more_length
byte 3, bit 6 = hash_is_valid

A minor detail, but yes, Sparc likes 32-bit aligned 32-bit words.

One difficulty is the possibility that a non-Unicode string and a Unicode string might encode the same
letters; if there are 8-bit characters, those require multiple bytes in Unicode, but only one byte in ASCII.
Do we care about this when interning?

Does the len describe the length in characters or the data length in bytes?
Not necessarily the same for Unicode.

No matter what:
Use a mathematically good hash function, I'll see if I can find a cheap fast one in the public domain.
"Universal" hash functions based on GF(2^8) are better than a poke in the eye with a sharp stick.
Seems perfectly fine to define that the zero-length string has a hash function of zero, therefore it can
be represented in as few as 2 bytes (if not doing the zero termination thing) or 3 (if zero terminating).
It might be worthwhile to consider aligning strings only to a 16-bit boundary, depending on how we
feel about "micro".

On 2014-01-10, at 4:47 PM, Paul Sokolovsky [email protected] wrote:

Well, since the beginning of the discussion I had different layout in head, sorry if I doesn't come out clear - I indeed mixed up few things. Let me try to summarize what I had in mind with requirements behind it.

Requirements:

• All identifiers (keywords, buildtin object, function, method names) in uPy core are interned - it means that comparison by ptr gives 100% probability of match/mismatch.
• It should be possible to intern/check internship of any string - quickly. This means that we should do match by hash and to strcmp() only if hashes match.
• Minimal practical size of hash is one byte. That would theoretically give 256x speed up on string comparisons, but let's go conservative and (arbitrarily) say we'll get only 100x speedup due to imperfection of hash function. Still good enough for MicroPython IMHO. (If you're concerned that 8 bits of hash are too few, see below).
• We should support non-interned strings. Then, if they're used as keys for a dictionary, we can't rely on ptr, so would need to use hash value. That means that storage layout of interned and non-interned string should include hash => it (storage layout can be the same).
• Hash function is expected to be quick, which means that if you loop over string for some reason, you can in parallel compute hash. But there's still important usecase when: a) we get a string from outside (I/O) as is, i.e. won't necessarily iterate over it; b) a string can be long (consider read(64K) or read(1M)), c) hash is needed only for few operations (dictionary lookup is the only 100% case). That calls for condition that hashing of non-interned can be lazy, and we need to signify "hash is not computed" case.
This leads to following layout

hash (1 byte) - as hash has fixed size (1 byte), it can go first (also, hash is the most discriminating string param, and *p is more efficient than *(p+N) for some CPUs)

flags and len lsbyte (1byte): bit 7: hash_valid, bit 6: more_length, bits 5-0: length lsbits (potentially, store few more flags here)

more length: if bit 6 of previous byte set, another len byte is here, if its bit 7 is set, another len byte follows, etc.

data: len bytes
C-compatible \0 term: 1 byte - we would need to go a bit out of way to get rid of it, could NOT do that.

That means that minimal-length non-empty string would be 4 bytes, which matches 32bit CPU alignment.

Extra features:

Concern: 8 bits of hash too few, for example, consider hash table with > 256s slots. Well, we can use following length byte as additional discriminator to get 16 bits of "hash". For micropython, that's definitely enough.

Actually, when doing string search, as optimization, we can match on 32-bit values at the beginning of the string. Beginnings of string ideally should be 32-bit aligned, but note that's the strict requirement only for Coretx-M0 & (classical) MIPS (dunno about Sparc, would assume too), but Cortex-M3 and higher and x86 can handle non-aligned access transparently.

—
Reply to this email directly or view it on GitHub.

from micropython.

dhylands commented on July 22, 2024

They way gcc optimizes strings also changes from time to time (as newer ways of doing things are discovered). So you might find that micropython starts to behave differently on a newer version of the toolchain.

from micropython.

pfalcon commented on July 22, 2024

#define Q_DEF(s) const char qstr_##s[] = "\x12\x34" #s;

Yes, natural progression of the idea. And by using arrays, we'll even be able to get rid of trailing \0 if needed.

from micropython.

pfalcon commented on July 22, 2024

For longer strings, record hash_is_set in bit 6 of byte 3

Probably not in default source, but you can make your fork, profile it, and share results of how much speed up you get while using such highly conditional encoding ;-).

One difficulty is the possibility that a non-Unicode string and a Unicode string; if there are 8-bit characters,

Python3 doesn't have non-unicode strings and 8-bit characters. Sadly. When hashing unicode strings (in any encoding), we'd rather hash chars (aka codepoints), not bytes. Note that currently uPy doesn't support unicode at all, so any suggestions are wishful thinking so far.

Does the len describe the length in characters or the data length in bytes?
Supposedly all the time in bytes, so we know what to pass to m_free() when garbage-collecting.

Use a mathematically good hash function

You're welcome to criticize one which will be used at the beginning (I vote for xor :-D ) and submit patches for alternatives (profiling data is plus) ;-).

It might be worthwhile to consider aligning strings only to a 16-bit boundary, depending on how we feel about "micro".

Per the above discussion, "performance-optimized" build would align on 32-bit boundary. But ideally we'd support 8-bit. Also note that GC implementation used has own alignment requirements.

from micropython.

piranna commented on July 22, 2024

Use a mathematically good hash function

You're welcome to criticize one which will be used at the beginning (I
vote for xor :-D ) and submit patches for alternatives (profiling data is
plus) ;-).

I was thinking about CRC-8...

from micropython.

dr2chase commented on July 22, 2024

On 2014-01-11, at 4:00 PM, Paul Sokolovsky [email protected] wrote:

Use a mathematically good hash function

You're welcome to criticize one which will be used at the beginning (I vote for xor :-D ) and submit patches for alternatives (profiling data is plus) ;-).

I don't have scads of time, or I would.
I have my own hacking and benchmarking to do for work.
What would be nice is to avoid standardizing on a poor hash
function, which is common practice. XOR is not so awesome.

I agree with Jesus that CRC-8 is a fine candidate.
It can be implemented with a 256-byte lookup table.
See http://www.miscel.dk/MiscEl/CRCcalculations.html#CRC8bi
for examples.

There are also good "bit shufflers" that I know less
about, a friend of mine knows more an I can ask him
for pointers.

Given that this is microPython, what are processors
that should be targeted for good performance?
Some of these processors (e.g., Intel Haswell)
have instructions that can be used to provide a nice
acceleration for CRC calculations on large strings.
But that might be silly.

from micropython.

pfalcon commented on July 22, 2024

XOR is not so awesome.

XOR is horrible. Only ADD is worse. But XOR+ROL would give good baseline to compare more "advanced" algos against.

It can be implemented with a 256-byte lookup table.

That's huge surely?

Given that this is microPython, what are processors that should be targeted for good performance?

For me it's simple: target not processors, but algorithms (large-scale, not just few features out of hundreds). Performance should be better than Lua and better than Node.js, or we won't take over the world ;-).

Some of these processors (e.g., Intel Haswell) have instructions

Who cares? ;-) Nobody will waste their tiny free time left from their work and real life to support somebody's magic instructions from press releases ;-).

Note: everything is just IMHO ;-).

from micropython.

dr2chase commented on July 22, 2024

On 2014-01-11, at 9:47 PM, Paul Sokolovsky [email protected] wrote:

XOR is not so awesome.

XOR is horrible. Only ADD is worse. But XOR+ROL would give good baseline to compare more "advanced" algos against.

It can be implemented with a 256-byte lookup table.

That's huge surely?

I was not sure; I do not know how micro micropython intends to be.
You might compromise and do 4 bits at a time with a 16-byte table.
You might also check the Bob Jenkins hash or the Murmur hash.

One reason to kinda like CRC-8 is that if you're working in a Galois field,
you can built other interesting stuff (i.e., it might have other uses besides
just hashing) in error correcting codes and things like that.

Given that this is microPython, what are processors that should be targeted for good performance?

For me it's simple: target not processors, but algorithms (large-scale, not just few features out of hundreds). Performance should be better than Lua and better than Node.js, or we won't take over the world ;-).

I think it depends what you want microPython to be. I signed on because I was particularly jazzed at being able to write useful fun stuff easily that would run with very low power; not necessarily minimum power (I lightly hack on Atmel 8-bit processors) but low enough to try interesting logging on a bicycle, stuff like that. I don't expect to see Lua or JS there -- IMO, one watt is a whole lot of power.

Some of these processors (e.g., Intel Haswell) have instructions

Who cares? ;-) Nobody will waste their tiny free time left from their work and real life to support somebody's magic instructions from press releases ;-).

Again, not sure where it's going. I know that these fancy instructions have their real-world uses -- computing CRC-32 is apparently important to some Big Data applications. They also help with AES de/encryption, don't know the details there. I didn't imagine micropython running on such a beast, but it was hard to tell from some of what I was reading (Sparc alignment, e,g, -- not much Sparc in the embedded space that I know of).

David

from micropython.

pfalcon commented on July 22, 2024

@dpgeorge: So, current situation is that we have 2 string types in existance: MP_OBJ_QSTR (i.e. one fitting into mp_obj_t) and mp_obj_str_t, and in a lot of places MP_OBJ_QSTR is not expected. I've been plugging holes with it here and there, but that's firefighting, we really should switch to single. But that's big enough refactor anyway and then it makes sense to do qstr refactor we discussed here at the same time. If you're busy with other stuff (I hope you do), would you be ok with me doing this refactor?

from micropython.

dpgeorge commented on July 22, 2024

I implemented a lot of the qstr stuff. It uses brain-dead length and hashing (fixed size length, addition for hash).

Take a look. What do you think of the framework? qstrs now allow null bytes and have hashes for faster lookup.

Do you still agree that we need 2 types of string internal implementations: qstr (stuffed in mp_obj_t) and non-interned proper object? They can both use the same methods in objstr.c, you just extract the data in different ways depending on the implimentation.

from micropython.

pfalcon commented on July 22, 2024

Thanks for going forward with this! Supporting \0 and comparison optimization with hash is what we definitely needed right away.

Do you still agree that we need 2 types of string internal implementations: qstr (stuffed in mp_obj_t) and non-interned proper object?

Yes, sure, I agree that we should have both interned and non-interned strings. But I kinda missed that you intended qstr-in-mp_obj_t to be used as interned string, and mp_obj_str_t as non-interned. I kinda thought that we switch to qstr-in-mp_obj_t altogether, and thought that we could handle internship implicitly. I now see that wouldn't work for dynamically created interned strings, so what you say is good plan.

But with that in mind, that's only partial implementation of what we discussed above. My idea was to make any static string used in uPy (except those used just for output, like printf) to have size/hash embedded. Then for all such string we'd get implicit compile-time internship, which would simplify builtin methods handling, etc. All in all, that's good start - it's cool that all tests pass, and there's enough work anyway to refactor all string methods to work on length-strings.

from micropython.

dpgeorge commented on July 22, 2024

I kinda thought that we switch to qstr-in-mp_obj_t altogether, and thought that we could handle internship implicitly.

Okay, now I understand fully what you meant! You wanted a string object to always be a raw pointer stuffed into an mp_obj_t, with bit 1 set to indicate this. The only difference between static/interned and dynamic is just where it pointed: ROM or RAM. The representation is the same. And that's why you were concerned about specifiying the encoding (UTF-8, UTF-16, ASCII) in the qstr data.

I now see that wouldn't work for dynamically created interned strings, so what you say is good plan.

Okay, now I don't understand this statement. Why won't your original idea work? (One thing I can say is that it doesn't play nice with the GC, since the pointer to a dynamic string is not a proper pointer that the GC will recognise.)

Note that with my way of doing it, the only performance hit is that a dynamic string needs only an extra word (or 2) in RAM, and that you need to distinguish between qstrs and dynamic strings. The storage for the dynamic string would like mp_obj_tulpe_t: a uPy header then the actual data (plus any extra info you wanted, like len and alloc). So there's no additional ptr dereference to get the data from the uPy object.

But with that in mind, that's only partial implementation of what we discussed above. My idea was to make any static string used in uPy (except those used just for output, like printf) to have size/hash embedded. Then for all such string we'd get implicit compile-time internship, which would simplify builtin methods handling, etc.

This would only work with help from the compiler (and its constant pooling optimisation), something I don't want to rely on. But actually even this is not enough: to dynamically intern a string I need a list of all static interned strings so I can see if the string already exists in the intern pool. If we have implicit interning in, eg, method definitions, there's no way we can tell which strings have been statically interned (without looking into the .rodata section at runtime, which is going to be very compiler dependent...).

The only (?) solution to this (to stick with your idea) is to no longer require that qstrs are unique. A qstr is just a pointer to some data with a hash, len, and bytes. To check for equality, the hash and the data must match, ie we need to do a full strcmp for (at least) each successful lookup.

For efficient equality comparison of qstrs (eg method & attr lookup) we need to guarantee that qstrs are unique, so that 2 qstrs are equal if and only if their pointers (or indexes, as is done now) match.

At the moment we can actually have method names as qstrs: just add the string to the qstrdefs.h file and use the constant MP_QSTR_xxx where xxx is your string value. I think this is a neat solution.

from micropython.

pfalcon commented on July 22, 2024

Okay, now I understand fully what you meant! You wanted a string object to always be a raw pointer stuffed into an mp_obj_t, with bit 1 set to indicate this.

Not exactly, per #22, I wanted to come up with a grand unified scheme of storing string-like data. Then when you introduced qstr-in-mp_obj_t, I thus automagically thought about switching to qstr-in-mp_obj_t, which was logical mistake on my side.

The only difference between static/interned and dynamic is just where it pointed: ROM or RAM. Okay, now I don't understand this statement. Why won't your original idea work?

Well, we also have dynamic interned strings. My idea was that we intern any dictionary-key like string, and so if we for example take a string from dictionary key, we know it's interned. But here's where it breaks: if we pass such string to another function, it no longer know if it's interned or not (in general case; it may happen that we'd pass interned string to functions where non-interned can never be passed, but that's too many if's for now).

Note that with my way of doing it, the only performance hit is that a dynamic string needs only an extra word (or 2) in RAM

Well, pointer to type + pointer to data is minimal overhead we can achieve, and if it goes beyond that (for example, to store length in object structure), then my original IMHO was that it's inefficient use of mem (that's why I wanted to use same encoding (which includes var-len len encoding) for all string objects).

The storage for the dynamic string would like mp_obj_tulpe_t: a uPy header then the actual data (plus any extra info you wanted, like len and alloc).

And this is not ok per my previous thought: any non-interned string can become interned whenever it is used as a key in dictionary. If the operation of turning non-interned string into interned requires memory allocation (to build hash+len header), then it's not efficient. That's why I wanted to have same underlying encoding for all strings.

My mistake is that I called such encoding [new] qstr, because I thought we can replace old qstr with this new encoding in one go. It's better to keep "qstr" for what it is now (and index into interning table), and use for example "hstr" for string with hash+varlen header.

This would only work with help from the compiler (and its constant pooling optimisation), something I don't want to rely on.

Above we found a way how to do our own constant string pooling without relying on compiler ;-).

But actually even this is not enough: to dynamically intern a string I need a list of all static interned strings so I can see if the string already exists in the intern pool.

Sure, but now that we have custom preprocessor, we can do any magic, like building any structures we want automatically!

The only (?) solution to this (to stick with your idea) is to no longer require that qstrs are unique. A qstr is just a pointer to some data with a hash, len, and bytes.

Yes, that's the idea I had stuck in my head ;-). Except that all static (i.e. compile-time) strings are automagically interned using a constant pool (maintained by custom preprocessor to avoid relying on compiler behavior).

To check for equality, the hash and the data must match, ie we need to do a full strcmp for (at least) each successful lookup.

Yes, that's the operation we'd do to intern a dynamic string.

At the moment we can actually have method names as qstrs: just add the string to the qstrdefs.h file and use the constant MP_QSTR_xxx where xxx is your string value. I think this is a neat solution.

Yes, but my whole idea was to make that all automatic. You just write Q(write) straight in the source code and you get static interned string object which is always ==-equal to any other Q(write) object written in the source code elsewhere. All MP_QSTR_xxx, etc. magic is handled by preprocessor behind the scenes. (Note: Q() above has nothing to do with Q() as used in current source, sorry again for confusion. Maybe it should be H() ;-) )

All in all: what we have now (especially with today's patch) is good enough, and we can pause and start leveraging binary-safe strings (and do other things). I'll follow up with my ideas, and try to implement them, paying attention to details, and if everything actually fall into place, will submit patches for review.

from micropython.

dpgeorge commented on July 22, 2024

Note that we cannot intern a non-interned string just to do a dictionary lookup. Dict lookups (a pure lookup, fail if not found) must not allocate memory (else it becomes another banned operation in interrupts, etc). If you do a dict lookup of a non-interned string and you try to intern it, then, even if it is stored with hash/len/etc, you might still need to allocate a new qstr pool to point to this new interned string.

from micropython.

dpgeorge commented on July 22, 2024

This can be closed: qstrs now have a hash and a length stored with them, this data is generated automatically, and strings don't need to be interned (and are usually not interned if their length is above a threshold).

from micropython.

curlyz commented on July 22, 2024

@dpgeorge May I ask what is the threshold ?

#ifndef MICROPY_ALLOC_PARSE_INTERN_STRING_LEN
#define MICROPY_ALLOC_PARSE_INTERN_STRING_LEN (10)
#endif

I thought that string with len > 10 will not be interned , but apprently not .
Also , every random variables name (that is not defined) when typed in REPL will also be interned.
Although the variable is not created , the interned string is there already .

Should we implement some special character to imply that the string should not be interned .
I expect that it is not Pythonic , but considering the limited RAM of embedded device , should we ?

from micropython.

dpgeorge commented on July 22, 2024

I thought that string with len > 10 will not be interned , but apprently not .

That's right, strings with len>10 are not interned.

Also , every random variables name (that is not defined) when typed in REPL will also be interned.
Although the variable is not created , the interned string is there already .

Yes, all variable names are interned. But that's not the same as strings.

Eg:

$ micropython
MicroPython v1.9.4-474-g2e3e5707b-dirty on 2018-08-17; linux version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> import micropython
>>> micropython.qstr_info(1)
qstr pool: n_pool=1, n_qstr=1, n_str_data_bytes=24, n_total_bytes=216
Q(/usr/lib/micropython)
>>> short_var = 1
>>> long_variable_name = 2
>>> 'short str'
'short str'
>>> 'longer string'
'longer string'
>>> micropython.qstr_info(1)
qstr pool: n_pool=1, n_qstr=4, n_str_data_bytes=72, n_total_bytes=264
Q(/usr/lib/micropython)
Q(short_var)
Q(long_variable_name)
Q(short str)

You can see that 'longer string' is not interned.

from micropython.

qstr uniqueness handling is overkill! about micropython HOT 39 CLOSED

Comments (39)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent