Giter VIP home page Giter VIP logo

msgspec's Introduction

msgspec

msgspec is a fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML. It features:

  • πŸš€ High performance encoders/decoders for common protocols. The JSON and MessagePack implementations regularly benchmark as the fastest options for Python.

  • πŸŽ‰ Support for a wide variety of Python types. Additional types may be supported through extensions.

  • πŸ” Zero-cost schema validation using familiar Python type annotations. In benchmarks msgspec decodes and validates JSON faster than orjson can decode it alone.

  • ✨ A speedy Struct type for representing structured data. If you already use dataclasses or attrs, structs should feel familiar. However, they're 5-60x faster for common operations.

All of this is included in a lightweight library with no required dependencies.


msgspec may be used for serialization alone, as a faster JSON or MessagePack library. For the greatest benefit though, we recommend using msgspec to handle the full serialization & validation workflow:

Define your message schemas using standard Python type annotations.

>>> import msgspec

>>> class User(msgspec.Struct):
...     """A new type describing a User"""
...     name: str
...     groups: set[str] = set()
...     email: str | None = None

Encode messages as JSON, or one of the many other supported protocols.

>>> alice = User("alice", groups={"admin", "engineering"})

>>> alice
User(name='alice', groups={"admin", "engineering"}, email=None)

>>> msg = msgspec.json.encode(alice)

>>> msg
b'{"name":"alice","groups":["admin","engineering"],"email":null}'

Decode messages back into Python objects, with optional schema validation.

>>> msgspec.json.decode(msg, type=User)
User(name='alice', groups={"admin", "engineering"}, email=None)

>>> msgspec.json.decode(b'{"name":"bob","groups":[123]}', type=User)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
msgspec.ValidationError: Expected `str`, got `int` - at `$.groups[0]`

msgspec is designed to be as performant as possible, while retaining some of the nicities of validation libraries like pydantic. For supported types, encoding/decoding a message with msgspec can be ~10-80x faster than alternative libraries.

See the documentation for more information.

LICENSE

New BSD. See the License File.

msgspec's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msgspec's Issues

Improve type validation error messages

Currently the messages that msgspec raises on validation error can be a bit cryptic:

In [1]: import msgspec

In [2]: class User(msgspec.Struct):
   ...:     name: str
   ...:     email: str
   ...:

In [3]: dec = msgspec.Decoder(User)

In [4]: enc = msgspec.Encoder()

In [5]: msg = enc.encode({"name": 1})

In [6]: dec.decode(msg)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-e19f0160a9aa> in <module>
----> 1 dec.decode(msg)

TypeError: Error decoding: expected `str`, got `int`

The message isn't incorrect, but it doesn't provide some necessary context for determining what part of the message was mistyped. For highly nested messages I'm not sure what this would look like. Right now I'm leaning towards something like:

"Error while decoding {context}: {details}"

The idea is to use either the most recent struct field in the type tree before the error, or the top-level compound type as context, then let the error to the right of the : speak for itself.

>>> Decoder(User).decode(...)  # For structs, include struct name, current field, and type of that field
DecodeError("Error while decoding `User.name` (`str`): expected `str`, got `int`")

>>> Decoder(List[User]).decode(...)  # For collections, include collection type
DecodeError("Error while decoding `List[User]`: expected `User`, got `int`")

>>> Decoder(Dict[Tuple[str, str]], User]).decode(...)  # For dicts, specify if in `key` or `value`
DecodeError("Error while decoding `key` of `Dict[Tuple[str, str], str]`: expected `tuple` got `str`")

>>> Decoder(List[User]).decode(...)  # if error is inside a struct, even if top-level is a collection, use struct as context
DecodeError("Error while decoding `User.name` (`str`): expected `str`, got `int`")

The hope is that this will be easy enough to implement in a performant way, while still providing decent error messages for debugging.

What other examples should we add?

In #147 we added a new Examples section to the docs, and added an example of using msgpsec to write a concise (and performant) GeoJSON implementation.

What other examples should we add? This might be other common schemas (JSON-RPC perhaps?) or integrations with other tools (ASGI/Starlette, requests, sqlite, ...)?

Support TypedDict types

We should support specifying a decode type as a TypedDict (https://docs.python.org/3/library/typing.html#typing.TypedDict) type. Many use cases for using TypedDict are better satisfied by using msgspec.Struct types, but there are a few good reasons one might prefer a TypedDict instead:

  • If the consuming code needs an actual dict type and a class won't suffice
  • If the keys in the dict aren't valid python identifiers. Structs will at some point support changing the serialized name of fields, which would remove this distinction, but I think supporting TypedDict types will require less work and may still be useful after that feature is added as well.

xref #76 (comment).

Support annotated-types for metadata and constraint specification

My understanding is that msgspec currently only supports basic type/schema validation, but not more complex validations like a regex pattern or arbitrary user functions, which Pydantic and other validators do support.

I think it would be really interesting if msgspec adopted annotated-types (maybe as an optional dependency), which would not only enable an API for these sorts of constraints but also:

  • Improve interoperability with other libraries (Pydantic V2 will support annotated-types, so users could go back and forth or share types between the two libraries)
  • Get automatic support for other parts of the ecosystem like Hypothesis

Fully support pattern matching on structs

Python 3.10 adds support for pattern matching. I believe this should work already for structs with purely keyword arguments, but will need to generate a __match_args__ attribute to match on positional arguments as well.

Add support for extension types

Msgpack supports custom extension types. We should at minimum provide a nice error for unsupported extension types, but it might be nice to add an extension mechanism here as well.

Using super class in decoder with tags

πŸ‘‹ What are your thoughts on the following? I may be misunderstanding something, too.

I'd like to use tags in the following manner (this modifies an example from the docs):

import msgspec

class TaggedBase(msgspec.Struct, tag_field="kind", tag=True):
    pass

class Get(TaggedBase):
    key: str

class Put(TaggedBase):
    key: str
    val: str

encoded = b'{"kind": "Put", "key": "my key", "val": "my val"}'
msgspec.json.decode(encoded, type=TaggedBase)  # docs use Union[Get, Put])

The above fails with

DecodeError: Invalid value 'Put' - at `$.kind`

This is surprising, as I figured TaggedBase was a more general type than the subclasses and this would work.

The following does work, however:

msgspec.json.decode(encoded, type=Union[Get, Put])

But I may have many subtypes, and would like to use the super class instead of a big Union syntax.

Build wheels for Linux/Mac ARM

With the Apple M1 using ARM, it would be valuable to support Linux and Mac ARM builds in the CI. I can take a look at this when I get a chance. I suspect is should be a small change to add the builds in the GitHub actions.

optional doesn't work with a custom class

i'm trying to decode a jsonrpc response which has an optional value which i need to process in dec_hook.

to do this i subclass address class from str, but get a decoding error when the value is null. i tried both Optional[address] and address | None. it works if i change to Optional[str], but i can't do the processing.

the doc says "Unions with custom types are unsupported beyond optionality (i.e. Optional[CustomType])", which implies this should be possible.

i extracted a small sample here which includes one working and one failing case. for the failing case i get:

msgspec.DecodeError: Expected `address`, got `NoneType` - at `$.to`

a more complete spec of what im building is here if it's useful.

PyPy compatibility

This was raised on twitter, and may be important for some users. Currently msgspec is written as a c extension using some private cpython apis, making it incompatible with pypy. It would be good to have a pypy compatible build, whether through changes to the c extension or through a separate pure python release. Either seems fine to me, provided the result doesn't decrease cpython performance noticeably.

Freezing a `Struct` containing a sub-nonhashable

I think I've more or less had the exact same issue with frozen=True in pydantic as well but this is the issue:

[ins] In [1]: import msgspec

[ins] In [2]: class Msg(msgspec.Struct, frozen=True):
         ...:     l: list[str]
         ...:

[ins] In [3]: msg = Msg([str(i) for i in range(10)])

[ins] In [4]: {msg: 10}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 {msg: 10}

TypeError: unhashable type: 'list'

I realize that there might not be an easy away around this (without some kind of nested hashable obj scanner) but I figured I'd report since there's no mention of this restriction in the docs.

Attrs Support?

Any consideration being given to supporting Attrs (https://www.attrs.org/en/stable/index.html) as a mechanism for Structs? It would be nice to leverage already existing attrs classes instead of refactoring to completely supporting the Structs. There is a workaround to just leveraging the attr.as_dict method to serialize into msgpack via encode; then deserialize into a from_dict classmethod; but that precludes taking advantage of the serialization/deserialization speedup documented for using structs over dicts.

Validation on initialization

Hi @jcrist,

Firstly - thank you for this exceptional library - might I recommend adding a buy me a coffee button to your repo? :)

I would like to get your recommendation for validating data when initialization - I ran into this by trying to reuse my msgspec definitions to also send (encode) some data, which then fell over on the receiving end due to a validation issue. I'm thinking along the lines of an optional flag?

I've read and understand your motivations (mypy and performance) and I suppose I could use something like pydantic or roll my own validation, or I could also do something like

class MyType(msgspec.Struct):
    def validate(self):
        return msgspec.json.decode(msgspec.json.encode(self), type=type(self))

But I'd be keen to hear your (and others?) thoughts.

Async support

Hey, is it planned/in the scope of the project to add support for async buffers/streams for decoding?

Nested Structs how to.

Sorry if I missed this in the docs (or tests which I tried to go through as well) but is it possible to create and decode a Struct with child structs that can be recursively decoded?

It seems the encoding works just fine but decoding will only work by decoding to a dict (the default):

[ins] In [2]: import msgspec

[nav] In [3]: class c(msgspec.Struct):
         ...:     s: msgspec.Struct

[nav] In [4]: class b(msgspec.Struct):
         ...:     val: int

[ins] In [5]: msgspec.encode(c(b(10)))
Out[5]: b'\x81\xa1s\x81\xa3val\n'

[ins] In [6]: msg = msgspec.encode(c(b(10)))

[ins] In [7]: msgspec.decode(msg)
Out[7]: {'s': {'val': 10}}

[ins] In [8]: msgspec.decode(msg, type=c)
Out[8]: c(s=Struct())

[ins] In [10]: msgspec.decode(msg, type=msgspec.Struct)
Out[10]: Struct()

If there is no plan to support nested structs natively, might there be a recommended recipe for this kind of thing?

For my purposes it would be lovely to have an IPC message format where payloads in certain types of messages could also be structs (who's schema could potentially be introspected from responding function annotations).

Cheers and thanks for the super sweet lib!

`Return` data type

Again, this is an issue to help me with the scenario of streaming without really implementing streaming.

I want a type, that when encountered, will stop parsing completely, and will give me the parts of the object already parsed, and the offset to the next point in the buffer.

For encoding, you can specify the type it will encode as.

This is to avoid saving a really big type, accepted by streaming for example, into memory completely.

Unlike #27, there doesn't need to be a resume login, let me as the client handle it.

What I ask for is something like this:

class Get(msgspec.Struct, tag=True):
    key: str

class Put(msgspec.Struct, tag=True):
    key: str
    val: str

class Huge(msgspec.Struct, tag=True):
    key: str
    some: int
    some2: str
    huge: Return[str]
    some_other: int

msgspec.json.encode([Get("1"), Put("1","2"), Huge("1", 2, "3", "4567", 5)])
# b'[{"type":"Get","key":"1"},{"type":"Put","key":"1","val":"2"},{"type":"Huge","key":"1","some":2,"some2":"3","huge":"4567", "some_other": 5}]'


msgspec.json.decode(d, type=List[Union[Get,Put,Huge]])
# [Get(key='1'), Put(key='1', val='2'), Return(key='1', some=2, some2='3', missing=["some_other"], offset: 114 )]

Support NamedTuple types

For describing a JSON-RPC schema, supporting NamedTuple types would be useful. These would encode to/decode from arrays, and would be treated similarly to fixed-length tuples (e..g tuple[int, str, float]), except:

  • They support optional arguments/default values
  • They decode as a custom type rather than a tuple directly

[FEA] A fallback mechanism

Feature request, make it possible to specify a fallback function that is called when the encoder encounters an unsupported object.
The fallback function can then handle the unsupported object using only msgspec compatible types. When decoding, a corresponding fallback function are then called to handle the decoding of the object.

Motivation

In Dask/Distributed we are discussing the replacement of msgpack. The library should be:

  • fast
  • secure (no arbitrary code execution when decoding)
  • support basic Python types
  • not convert list to tuples (can be disabled in msgpack but with a performance penalty)
  • not convert numpy scalars to python scalars
  • support a fallback mechanism

List(Msgspec.Struct)

I have not managed to have a field that is a List(msgspec.Struct) - is that possible? if not desirable?

Dispatching channels with multiple message types

Hi @jcrist

I'm using msgspec to decode some JSON from a channel that sends multiple message types, and am wondering how best to dispatch these messages to msgspec (please close if this is out of scope or you don't have any opinion).

I observe (roughly) the following times for a couple of methods I've tried, and they're pretty much ordered from fastest to most correct:

raw = b'{"Type":"Trade", "Symbol":"AAPL", "LastTradedPrice":100}'

# startswith - fastest, but requires Type to be at the beginning of JSON
%timeit raw.startswith(b'{"Type":"Trade"')
178 ns Β± 0.0241 ns per loop (mean Β± std. dev. of 7 runs, 10,000,000 loops each)

# in - slightly slower but more correct
%timeit b'"Type":"Trade"' in raw
430 ns Β± 2.78 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

# msgspec - most correct, parsing explicit Literal
class MessageType(msgspec.Struct):
    MessageType: Literal["Trade", "Quote", "Heartbeat"]

%timeit _ = msgspec.json.decode(raw, type=MessageType)
1.2 Β΅s Β± 0.707 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

# JSON decoding - Another option?
decoded = msgspec.json.decode(raw)
2.89 Β΅s Β± 15.7 ns per loop (mean Β± std. dev. of 7 runs, 100,000 loops each)

Obviously theres standard performance trade offs here; but I wonder if you've thought about handling this in msgspec or have any other more correct but also performant suggestions?

Thank you again for this wonderful library!

Add `default` callback to `Encoder`/`encode`

To support serializing arbitrary objects, it'd be useful to add a default callback (mirroring the option in json.dumps, orjson.dumps, ...). If provided, this would be called on unsupported objects, and would be expected to return a supported object.

def default(obj: Any) -> Any:  # where the out Any is a supported type
    ...

This would allow for converting unsupported types into supported types lazily, as well as fancier things like serializing unsupported large objects out-of-band.

Add support for datetime objects

It'd be useful to add support for datetime objects. I'm not sure how we'd want to go about serializing these, there's a few different options:

  • datetime.datetime: Could use the Timestamp extension, or as a string according to e.g. ISO 8601.
  • datetime.timedelta/datetime.time/datetime.date: Custom extension format? Or as a string per ISO 8601?

We likely need to support the extension type for timestamps, the string format may(?) be useful as well, and may provide nicer parity with other types if we decide to implement those as string formats.

mypy plugin

Out of the box mypy will properly catch errors when using

  • msgspec.json.encode / msgspec.msgpack.encode
  • msgspec.json.decode / msgspec.msgpack.decode
  • msgspec.json.Decoder / msgspec.msgpack.Decoder
  • msgspec.json.Encoder / msgspec.msgpack.Encoder

This includes inferring the output type of the decode methods. It will also properly infer the type and presence/absence of attributes on msgspec.Struct objects. But it won't properly catch errors in the struct constructors.

import msgspec

class Point(msgspec.Struct):
    x: int
    y: int

p = Point(1, 2)
p.x + "string"   # mypy will catch this, since it knows `Point.x` is an int

p = Point("oops", "bad")  # this won't be caught by mypy

IIUC we'd need a mypy plugin to support this, similar to what pydantic does.

Testing 3.11

πŸ‘‹ I've been testing 3.11b4 on an RPi. I'm not sure, but msgspec may need some changes? I'm not experienced enough here to know if it's an error in my installation / with RPi, or an issue with msgspec.

$ python3
Python 3.11.0b4 (main, Jul 12 2022, 12:12:30) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from msgspec.json import encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/msgspec/__init__.py", line 1, in <module>
    from ._core import (
ImportError: /usr/local/lib/python3.11/site-packages/msgspec/_core.cpython-311-arm-linux-gnueabihf.so: undefined symbol: _PyObject_GC_Malloc
>>>

Support unions of custom types

Hi there, I was trying out msgspec as an alternative for attrs with cattrs, but I ran into a roadblock with a union type that I have not had a problem with before.
Specifically, I have a message type with field for IP addresses, since python differentiates between IPv4 and IPv6 addresses, but I do not, I need to use a union like so:

import ipaddress
import msgspec

IPAdress = ipaddress.IPv4Address | ipaddress.IPv6Address

class Message(msgspec.Struct):
    ip: IPAdress

msg = Message(ipaddress.ip_address("127.0.0.1"))

def enc_hook(obj):
    match obj:
        case ipaddress.IPv4Address() as value:
            return {"__ipaddress__": str(value)}
        case ipaddress.IPv6Address() as value:
            return {"__ipaddress__": str(value)}
        case _:
            raise TypeError(f"Cannot encode objects of type {type(obj)}")

def dec_hook(type, obj):
    # not quite done, I think msgspec raises before reaching this.
    if type in (ipaddress.IPv4Address, ipaddress.IPv6Address):
        return ipaddress.ip_address(obj)

When I then use this code, I can encode:

>>> d = msgspec.msgpack.encode(msg, enc_hook=enc_hook)
>>> d
b'\x81\xa2ip\x81\xad__ipaddress__\xa9127.0.0.1'

But trying to decode the encoded result back again results in the following error message:

>>> msgspec.msgpack.decode(d, type=Message)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Type unions may not contain more than one custom type - type `ipaddress.IPv4Address | ipaddress.IPv6Address` is not supported

Having read the docs about how to create tagged unions, I can only see it working for subclasses using msgspec and not for classes that already exists.

In this specific case, both types of the union can share the same logic for serialization and deserialization:

  • Serialization: str
  • Deserialization: ipaddress.ip_address

As an aside: I also have the same need for dealing with IP networks, where deserialization becomes ipaddress.ip_network:

IPNetwork = ipaddress.IPv4Network | ipaddress.IPv6Network

It could be that the shared logic makes it a little easier to construct something that makes it possible to handle unions like these.

json.Decoder memory leak?

Hello,

I've got a service written in Python that reads data from ElasticSearch frequently 24/7.

Recently I've migrated it from orjson to msgspec.json. Since then, the service runs out of memory pretty quickly.

Following https://docs.python.org/3/library/tracemalloc.html#pretty-top I'm able to capture the top 10 lines that contributes large memory usage, which turns out it's the decode(...) method from msgspec.json.Decoder

Top 10 lines
#1: /app/elasticsearch_repository.py:69: 26821.3 KiB
    self.__decoder.decode(
#2: /app/prepare_data_service.py:280: 3649.2 KiB
    cache_info_rt = decoder.decode(cache_info_data)
#3: /usr/local/lib/python3.10/multiprocessing/reduction.py:51: 1568.1 KiB
    cls(buf, protocol).dump(obj)
#4: /usr/local/lib/python3.10/linecache.py:137: 628.2 KiB
    lines = fp.readlines()
#5: /usr/local/lib/python3.10/json/decoder.py:353: 82.7 KiB
    obj, end = self.scan_once(s, idx)
#6: /usr/local/lib/python3.10/multiprocessing/queues.py:122: 40.2 KiB
    return _ForkingPickler.loads(res)
#7: /usr/local/lib/python3.10/tracemalloc.py:67: 11.9 KiB
    return (self.size, self.count, self.traceback)
#8: /app/elasticsearch_repository.py:68: 3.7 KiB
    return [
#9: /usr/local/lib/python3.10/http/client.py:1293: 3.6 KiB
    self.putrequest(method, url, **skips)
#10: /app/venv/lib/python3.10/site-packages/elasticsearch/client/utils.py:347: 1.8 KiB
    return func(*args, params=params, headers=headers, **kwargs)
193 other: 70.2 KiB
Total allocated size: 32880.9 KiB

Here are the structs for decoding:

'''
structs for msgspec
'''
from typing import Dict, List, Optional

from msgspec import Struct


class Query(Struct):
    """
    Struct for Query
    """
    date: str
    depAirport: str
    arrAirport: str


class Request(Struct):
    """
    Struct for Request
    """
    supplier: str
    tripType: str
    fareClass: str
    adultAmount: int
    childAmount: int
    infantAmount: int
    queries: List[Query]
    timestamp: int


class Segment(Struct):
    """
    Struct for Segment
    """
    fareClass: str
    depDate: str
    depTime: str
    flightNo: str
    carrier: str
    orgAirport: str
    arriveDate: str
    arriveTime: str
    dstAirport: str


class Flight(Struct):
    """
    Struct for Flight
    """
    segments: List[Segment]


class Price(Struct):
    """
    Struct for Price
    """
    price: float
    tax: float
    totalPrice: float
    seatsStatus: Optional[str] = None
    currencyCode: Optional[str] = None


class Trip(Struct):
    """
    Struct for Trip
    """
    flights: List[Flight]
    prices: Dict[str, Price]
    extended = Dict[str, str]


class Result(Struct):
    """
    Struct of Result
    """
    trips: List[Trip]


class CacheInfo(Struct):
    """
    Struct of CacheInfo
    """
    request: Request
    result: Result

I read from https://jcristharif.com/msgspec/structs.html#struct-nogc that

structs referencing only scalar values (ints, strings, bools, …) won’t contribute to GC load, but structs referencing containers (lists, dicts, structs, …) will.

Is this related? What's the recommendation to resolve this issue?

Thanks!

Can't use enc_hook to fix naive datetimes

The code raises an exception when it encounters a datetime without tzinfo. It would be much better if I could fix it in enc_hook, but, alas, datetime seems to be a "supported" type, therefore enc_hook never gets called on it.

Extend dict keys to allow for string literals?

Currently, string literals can't be used as dict keys in a Struct. See example below:

from typing import Literal
from msgspec import Struct, json

AorB = Literal["a", "b"]

class Test(Struct):
    d: dict[AorB, float]

data = Test(d={'a': 1.0, 'b': 2.0})
encoded = json.encode(data)
print(encoded)

decoded = json.decode(encoded, type=Test) # TypeError: JSON doesn't support dicts with non-string keys - type `dict[typing.Literal['a', 'b'], float]` is not supported

I would think it should be allowed - but I'd like to hear your thoughts.

Improve "Input data was truncated" error message

Consider:

msgspec.json.decode(b"[[1,2,3,4,5], [3,4,5,6,7], [8,")

Will throw:

DecodeError: Input data was truncated

Which is pretty lacking in information.

By the point its throwing it (correct me if i'm wrong), we've already parsed some of the data.

It would be useful to know:

  1. What object was failed to parse? what was its type, and where it began? (in the example: the location of the [ before the 8, and then the first [)
  2. What are the objects that have been parsed so far? (in the example: [1,2,3,4,5], [3,4,5,6,7], and 8)

The scenario I'm looking for is implementing streaming-like parsing for a very specific case, and I know streaming isn't planned (#27), so I'm trying to find other small features that might be possible and will help me.

In this case, resuming to parse after more data arrives, without re-parsing everything.

Doc suggestions

Overall I thought the docs were great. Clear, motivated, and succinct. Here are a few disorganized thoughts/suggestions I had when reading them:

  • I love the benchmarking note
  • Typo: encode -> decode:
    create a ``Decoder`` once and use the ``Decoder.encode`` method instead.
  • I found this sentence confusing. It seemed like there was a "but" clause coming, then there was none:
    While commonly used with `msgspec.Struct` types, most combinations of the
    following types are supported (with a few restrictions, see
    :ref:`supported-types` for more information):
    . Is there some caveat I'm not understanding?
  • In a lot of the sample code snippets I found myself wanting to see more PEP 604 and PEP 585 syntax for type annotations. This would remove some imports, at the cost of a __future__ import, I suppose. I am not sure what the current state of the art is for the places where a type is used in a value position, as is the case for your Decoder class. Probably it's not backwards compatible enough for you.
  • πŸ‘ Invalid enum value 'GRAPE'
  • I found the initial asarray discussion a bit confusing in Usage. Mostly I was initially confused disambiguating between C arrays, numpy arrays, and the serialization format. It became clearer as I read further, but my initial contact was a little muddled.
  • Does msgspec support implicit Optional (not recommended, but supported by most type checkers)?
  • Ooh, a tantalizing reference to pattern matching. It would be cool to see an example of that.
  • Does the "Avoid Decoding Unused Fields" trick work with asarray=True?
  • Seems to be an escaping issue with "\n" in the docs here:
    payload contains a JSON message followed by `b"\n"`.
  • Can enc_hook also be given to add custom handling to a type the msgspec does natively support? Like, if I wanted to use it to truncate a float or something, could I do it? I don't know if it's ever a good idea, but based on my reading of the docs, that is not currently possible.

It's not a problem with the docs, but I also found myself wondering if it was in scope to have any translation code to things like openapi/jsonschema.

Consideration for a streaming `Decoder`

I was able to get basic funtionality integrated into a project but got hung up on not having an easy way to decode streamed msgpack data into objects incrementally.

msgpack-python offers the streaming Unpacker api which is implemented in cython.

I would be great to get something similar in this project for convenient stream processing without having to do manual parsing for object message delimiters.

I would be interested to provide a patch for this support.
Would you accept one as such in cython or would you insist on C?

conda-forge package

In addition to wheels, it would be good to have a conda-forge package of this as well to aid with Conda installations

Question: Are nested properties supported?

Hi,
first thank you for your amazing work with this library. Is is unbelievable how much faster msgspec is compared to other json parser for python!
Is it possible to define nested properties in msgspec without defining a new Struct for each nesting level?
E.g.:
json:
{ "a": {
"b" : { "c" :1}
}
... many more properties ...
}

I want to only map "c". Can this be done without creating a Struct for "a" and "b" as well?

Best regards,
Christian

Investigate switching to binary search for struct field lookup

Currently we use an (augmented) linear search when matching struct fields to struct keys. This is performant if the number of fields is small (for some value of small) or the fields are serialized in roughly the same order as they're defined. If mgspec is used with the same schema for both serializer and deserializer, this is as efficient as it can get (matching is a single memcmp call).

For large numbers of fields or fields in different serialization orders, linear search likely won't perform as well. We should investigate modifying the search algorithm to maybe fallback to binary search if the first attempt fails (so we're still performant in the common case, but not terrible in the pathological case).

JSON support?

The schema handling logic and struct objects aren't MessagePack specific, and could be adopted for other protocols. This would allow an application to switch between JSON and MessagePack, while making use of the same schemas and objects throughout. I don't plan on working on this anytime soon, but adding support for JSON here isn't unreasonable. Unlike MessagePack though, parsing JSON is nontrivial, so adding support would require some care.

support arbitrary-sized integers

i often deal with 256-bit integers, but msgspec only appears to support up to 64-bit integers.

it would be great to still be able to encode larger integers even if it's slower.

>>> json.dumps({'big': 2**256-1})
'{"big": 115792089237316195423570985008687907853269984665640564039457584007913129639935}'

>>> msgspec.json.encode({'big': 2**256-1})
OverflowError: int too big to convert

Extending the built-in type set with (a) tagged union(s) isn't supported?

My use case: handling an IPC stream of arbitrary object messages, specifically with msgpack. I desire to use Structs for custom object serializations that can be passed between memory boundaries.


My presumptions were originally:

  • a top level decoder is used to process msgpack data over an IPC channel
  • by default, i'd expect that decoder will decode using the python type set to be able to accept arbitrary msgpack bytes and tagged msgspec.Structs
  • if a custom tagged struct was placed inside some std python type (aka embedded), i'd expect this decoder (if enabled as such) to be able to detect the tagged object field (say {"type": "CustomStruct", "field0": "blah"}) and automatically know that the embedded msgpack object is one of our custom tagged structs and should be decoded as a CustomStruct.

Conclusions

Based on below thread:

  • you can't easily define the std type set and a custom tagged struct using Union
  • Decoder(Any | Struct) won't work even for top level Structs in the msgpack frame

This took me a (little) while to figure out because the docs didn't have an example for this use case, but if you want to create a Decoder that will handle a Union of tagged structs and it will still also process the standard built-in type set, you need to specify the subset of the std types that don't conflict with Struct as per @jcrist's comment in the section starting with:

This is not possible, for the same reason as presented above. msgspec forbids ambiguity.

So Decoder(Any | MyStructType) will not work.

I had to dig into the source code to figure this out and it's probably worth documenting this case for users?


Alternative solutions:

It seems there is no built-in way to handle an arbitrary serialization encode-stream that you wish to decode into the default set as well as be able to decode embedded tagged Struct types.

But, you can do a couple other things inside custom codec routines to try and accomplish this:

  • create a custom boxed Any struct type, as per @jcrist's comment under the section starting with:

    Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.

  • consider creating a top-level boxing Msg type and then using msgspec.Raw and a custom decoder table to decode payload msgpack data as in my example below

Non-string tag values

I have an existing encoding that almost-but-not-quite fits msgspec. The "almost" is that tags are integer rather than string. Would it make sense to accept non-string tags? I think the only non-string case that would be valuable is for integers.

Add support for `typing.ByteString`/`collections.abc.ByteString`

When parsing type signatures for structs, it'd be useful to support a more abstract bytes-like object. For encoding this would result in no major changes (encoding only deals with concrete types). For decoding, this would decode the type as bytes (unless an extension type was used).

Allow encoding and decoding using array_like True and False

I'd like to use the package to minimize a payload. Data is received from an endpoint in JSON and I would like to encode it array-like in messagepack. However this currently seems hard/impossible to do. Say there is some data coming in:

class User(msgspec.Struct, array_like=False):
    name: str

d = {"name": "John"}
u = msgspec.json.decode(bytes(json.dumps(d), encoding="utf-8"), type=User)

I want to encode it for storage/IO in the array-like structure, however I cannot use the already defined Structs for this.
e.g. something like:

msgspec.msgpack.encode(u, array_like=True)

After doing operations/storing it or whatever, I'd like to be able to also reverse this process. I.e. turn the array-like messagepack encoded payload back into a human readable json payload.

I'm also having a hard time finding a DIY workaround to make this work. Any tips would be appreciated!

Segfault when creating a `Decoder` inside a struct's `__init_subclass__` method

When creating a decoder object inside a struct's __init_subclass__ method, a segfault occurs.

class SerializationBase(msgspec.Struct):
    __encoder = msgspec.msgpack.Encoder()

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        cls.__decoder = msgspec.msgpack.Decoder(cls)

    def dumps(self):
        return self.__encoder.encode(self)

    def loads(self, data: bytes) -> Any:
        return self.__decoder.decode(data)

class Param(SerializationBase, msgspec.Struct):
    param: int

This is because the base tp_new method handles __init_subclass__ internally, which means that __init_subclass__ gets called before tp_new returns to let StructMeta_new finish initializing the struct class. CPython really does not want you to define a type object with a custom struct layout.

To work around this, we probably need to split StructMeta_new into two parts.

  • StructMeta_new will look much the same up until the tp_new call. Here we pass in the constructed struct_fields/struct_defaults/... parameters as kwargs to tp_new, which will get forwarded to __init_subclass__.
  • We then handle finishing the type initialization in a new Struct.__init_subclass__ method, extracting the information forwarded as kwargs above. Provided the user calls super().__init_subclass__(**kwargs) before touching any of the things we should be ok.

To avoid name collisions, we should probably pack all the internal details into a single kwarg with an unlikely name (e.g. __msgspec_internal__=(fields, defaults, offset_lk)).

We'll also want to harden StructMeta_prep_types to error nicely in the presence of a half-initialized type (rather than segfaulting). We'll also want to document using __init_subclass__ for this pattern (it seems genuinely useful), showing calling the super method first before initializing the base class.

Originally posted by @dhirschfeld in #70 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.