microsoft / yardl Goto Github PK

Tooling for streaming instrument data

Home Page: https://microsoft.github.io/yardl/

License: MIT License

Dockerfile 0.13% Shell 0.06% CMake 0.15% C++ 58.42% Go 11.72% PowerShell 0.03% Smarty 0.01% Just 0.07% Python 15.88% MATLAB 13.55%

yardl's People

Contributors

Stargazers

Watchers

Forkers

johnstairs sarvex dchansen kohnnness strogo

yardl's Issues

Error reading empty stream if it is the last step in a Protocol

If:

the last step in a Protocol is a stream, and
the user is batch reading the stream, and
the stream is empty, the Reader.Close() method throws an incorrect error.

Using 28aa4af and the following model:

EmptyTest: !protocol
  sequence:
    strings: !stream
      items: string

and the following demonstration program:

int main(void) {
  ::binary::EmptyTestWriter w("test.bin");

  std::vector<std::string> strings;
  w.WriteStrings(strings);
  w.EndStrings();
  w.Close();

  ::binary::EmptyTestReader r("test.bin");

  int count = 0;
  strings.reserve(10);
  while (r.ReadStrings(strings)) {
    for (auto const& s : strings) {
      (void)(s);
      count++;
    }
  }
  assert(count == 0);

  r.Close();

  return 0;
}

the call to r.Close() throws the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Expected call to ReadStrings() but received call to Close() instead.

Wrong error message in Python get_dtype for generic type missing type arguments

This is a very minor bug, but it led to a necessary review of how the generated Python get_dtype function should work.

Given a model with an aliased, generic type:

GenericRecord<T>: !record
  fields:
    v: T

AliasedRecord<T>: GenericRecord<T>

If I call get_dtype on GenericRecord without specifying type arguments, I get a useful error message:

m.get_dtype(m.GenericRecord)
...
RuntimeError: Generic type arguments not provided for <class 'm.types.GenericRecord'>

But if I do the same for the aliased type, I do not get the same, expected error message:

m.get_dtype(m.AliasedRecord)
...
RuntimeError: Cannot find dtype for ~T

The user does not know what ~T is.

`yardl generate --watch` exits if model initially has errors

If the model has any errors, yardl generate --watch will exit instead of entering the watch loop. Appears to have been introduced in #94 .

Use defined alias types are not checked for compatibility with existing types in a union

When defining a union type field within a record there is validation which ensures all type cases in a union are distinct. User defined aliases are not checked. This leads to issues compiling generated code due to variants containing multiples of the same underlying type. There also appears to be a similar conflict with 'size' and 'uint64'.

MyIntType: uint64 
MyRecord: !record
  fields:
    one: [uint64, MyIntType]

MyRecord: !record
  fields:
    one: [uint64, size]

Records defined as such succeed in code generation, but that code cannot be compiled.

Add RelativeTime?

It seems useful to have a RelativeTime, i.e. offset w.r.t. some defined DateTime, such as a scan start. This would be quite useful in de-identifying some data. In some cases, the time of scan needs to be removed from the data, but it'd be painful to have to adjust all times in the file.

generate `setup.py` or similar

It'd make sense to be able to install generated python.

Add MATLAB support

Implement MATLAB codegen.

container-independent access

At present, I believe the user has to know if the stored data is binary or HDF5, and instantiate the corresponding class. That's efficient but also very inconvenient. It would certainly be nice to be able to write some client-code that does not depend on the container-type. (Edit: I see that there are abstract classes in protocols.h already, so possibly the only thing that's necessary is a factory that determines the container-type given a filename)

Implement `switch` expression in expression language, not YAML

Computed fields are currently an embedded expression language within a YAML file. switch expressions (to work with unions and optional types) are not expressed in this language, but rather as YAML nodes:

optionalNamedArrayLength: # YAML
  !switch optionalNamedArray: # YAML
    NamedNDArray arr: size(arr) # YAML-type-expression hybrid
    null: 0 # YAML-type-expression hybrid

This does not allow switch expressions to be used as part of larger expressions (type conversions, a function call argument, etc).

Instead, we should consider making switch part of the expression language. The example above might then look like:

optionalNamedArrayLength: |
  switch(optionalNamedArray) {
    NamedNDArray arr: size(arr)
    null: 0
  }

On the other hand, this syntax introduces curly braces within a YAML document, where indentation is usually favoured.

Best practices for HDF5 version management?

Anyone has any suggestions for how to handle HDF5 versions in CMake? I have built STIR with a particular version of HDF5, and my yardl stuff accidentally with another version. Result: crash at start-up time.

Add JSON serialization target

The ability to write a protocol out at JSON could be useful for debugging, even if it is not well-suited for large streams of scientific data.

exported CMakeLists.txt problems

should set CMAKE_CXX_STANDARD (what's the current minimum?)
should set target_include_directories
do we really need to depend on the C HDF5 libraries ?

why is this code here, and if it's needed, why is it before the find_package(HDF5)? (it'll be overwritten, no?)

if(VCPKG_TARGET_TRIPLET)
  set(HDF5_CXX_LIBRARIES hdf5::hdf5_cpp-shared)
else()
  set(HDF5_CXX_LIBRARIES hdf5::hdf5_cpp)
endif()

Invalid Python codegen when record contains aliased nullable union type

Given the following model on commit 42be458:

X: [null, int, float]

MyRec: !record
  fields:
    a: X

We get the following exception when importing the generated code:

Traceback (most recent call last):
  File "/workspaces/yardl/python/run_sandbox.py", line 5, in <module>
    import sandbox
  File "/workspaces/yardl/python/sandbox/__init__.py", line 21, in <module>
    from .types import (
  File "/workspaces/yardl/python/sandbox/types.py", line 121, in <module>
    get_dtype = _mk_get_dtype()
                ^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/types.py", line 117, in _mk_get_dtype
    dtype_map.setdefault(MyRec, np.dtype([('a', get_dtype(typing.Optional[X]))], align=True))
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 87, in <lambda>
    return lambda t: get_dtype_impl(dtype_map, t)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 60, in get_dtype_impl
    return _get_union_dtype(get_args(t))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 81, in _get_union_dtype
    inner_type = get_dtype_impl(dtype_map, args[0])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/python/sandbox/_dtypes.py", line 76, in get_dtype_impl
    raise RuntimeError(f"Cannot find dtype for {t}")
RuntimeError: Cannot find dtype for <class 'sandbox.types.X'>

Another problem is that the dtype for [null, int, float] should be np.object_, instead of {"has_value": np.bool, "value": np.object_}

Another problem is that the generated union classes that do not have a named type (e.g. Int32OrString) are not recognized by get_dtype() and throw.

confusing naming of functions/members

There is some renaming of members going on in the generated code, but it is not consistent

ScannerInformation: !record
  fields:
    tofBinEdges: !array
  computedFields:
    numberOfTOFBins: size(tofBinEdges)-1

leads to tof_bin_edges member in both C++ and Python, but NumberOfTOFBins() (note capital N) in C++ while number_of_tof_bins() in Python

Personally I'd try to avoid any renaming, but maybe that is difficult when covering multiple languages. We could enforce naming in the yardl model?

binary format doc on alignment

By using varints , strings etc t's possible that data is not aligned to a 32-bit or whatever boundary. It doesn't seem documented if the binary format fills in the gaps or not. This certainly needs to be documented for Records and Streams.

environment.yml is overcomplete (and Linux specific)

On Windows from Powershell.

mamba env create  --file environment.yml

Looking for: ['bash-completion=2.11', 'ccache=4.5.1', 'clang-format=14.0.4', 'cmake=3.21.3', 'fmt=8.1.1', 'gcc_linux-64]

Could not solve for environment specs
Encountered problems while solving:
  - nothing provides requested bash-completion 2.11**
  - nothing provides requested gcc_linux-64 11.2.0**
  - nothing provides requested gdb 11.2**
  - nothing provides requested gxx_linux-64 11.2.0**
  - nothing provides requested valgrind 3.18.1**

The environment can't be solved, aborting the operation

I guess we should remove valgrind and gdb? Even bash_completion and ccache. Maybe even clang-format.

Of course, the justfile is bash/Linux specific as well and I guess Windows support is for later.

PS: Is pinning the compiler version etc best practice? Maybe some of these could be >=?

Unify type and expression parsers

Vectors of bools broken

Vectors of booleans:

V: !vector
  items: bool

are not handled because the .data() method is deleted from the std::vector<bool> specialization. Additionally, binary serialization should write out a bitstream where each value is a bit rather than a byte.

Raise error or warning when type parameters are unused

When a generic type's type parameter is unused, we should raise an error or warning.

MyUnion<T, U>: [T, int]

U is unused.

Serialized stream lacks terminating byte if you don't use a protocol Writer as a context manager in Python

In Python, the binary protocol Writer is meant to be used as a context manager, e.g.

with MyProtocolWriter(filename) as w:
   w.write...

It is also possible to use the class directly and manually call its .close() method when finished, e.g.

w = MyProtocolWriter(filename)
w.write...
w.close()

However, when using it in this form, the zero byte normally written to terminate the stream is not written at all. This causes an unexpected error when reading the stream later (either an early EOF, or unexpected call to read a different protocol step).

Model:

MyProtocol: !protocol
  sequence:
    xs: !stream
      items: int

Example:

from issue.binary import BinaryMyProtocolWriter, BinaryMyProtocolReader

w = BinaryMyProtocolWriter("test.bin")
w.write_xs(list(range(42)))
w.close()

r = BinaryMyProtocolReader("test.bin")
xs = r.read_xs()
assert len(list(xs)) == 42
r.close()

Run it:

Traceback (most recent call last):
  File "/workspaces/yardl/joe/issue-#137/python/test.py", line 9, in <module>
    assert len(list(xs)) == 42
               ^^^^^^^^
  File "/workspaces/yardl/joe/issue-#137/python/issue/protocols.py", line 118, in _wrap_iterable
    yield from iterable
  File "/workspaces/yardl/joe/issue-#137/python/issue/_binary.py", line 971, in read
    while (i := stream.read_unsigned_varint()) > 0:
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/joe/issue-#137/python/issue/_binary.py", line 228, in read_unsigned_varint
    self._fill_buffer(1)
  File "/workspaces/yardl/joe/issue-#137/python/issue/_binary.py", line 299, in _fill_buffer
    raise EOFError("Unexpected EOF")
EOFError: Unexpected EOF

Python ndjson error reading aliased nullable union with value None

Using yardl v0.4.0, if you add an aliased nullable union to a Protocol sequence, the NDJSON reader will crash if the value of that step is None.

Example:

GenericNullableUnion2<T1, T2>: [null, T1, T2]

RecordWithUnions: !record
  fields:
    value: [null, int, string]
    aliasedValue: GenericNullableUnion2<int, string>

Then, using the following code to convert an instance of RecordWithUnions to json and back again:

import yay

converter = yay.ndjson.RecordWithUnionsConverter()

json = converter.to_json(yay.RecordWithUnions())

r = converter.from_json(json)

The last line throws:

Traceback (most recent call last):
  File "/workspaces/yardl/joe/issue-#113/python/test.py", line 7, in <module>
    r = converter.from_json(json)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/joe/issue-#113/python/yay/ndjson.py", line 58, in from_json
    aliased_value=self._aliased_value_converter.from_json(json_object["aliasedValue"],),
                                                          ~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'aliasedValue'

Python `get_dtype` does not work for vectors, arrays, and maps

Using the current test model, yardl throws a RuntimeError: Cannot find dtype for each of the following true assertions:

import test_model as tm

assert tm.get_dtype(tm.AliasedGenericVector[int]) == np.object_

assert tm.get_dtype(tm.AliasedGenericFixedVector[int]) == np.int32

assert tm.get_dtype(tm.AliasedGenericDynamicArray[int]) == np.object_

assert tm.get_dtype(tm.AliasedGenericFixedArray[int]) == np.int32

assert tm.get_dtype(tm.basic_types.AliasedMap[str, int]) == np.object_

Invalid Python generated for records containing optional generic fields

Using yardl 2d61ba3 with the following minimal model:

MyRecord: !record
  fields:
    myField: RecordWithGenericOptional<string>

RecordWithGenericOptional<T>: !record
  fields:
    value: T?

The generated types.py does not properly initialize the inner record class. See generated classes below:

class RecordWithGenericOptional(typing.Generic[T]):
    value: typing.Optional[T]

    def __init__(self, *,
        value: typing.Optional[T],
    ):
        self.value = value

    ...


class MyRecord:
    my_field: RecordWithGenericOptional[str]

    def __init__(self, *,
        my_field: typing.Optional[RecordWithGenericOptional[str]] = None,
    ):
        self.my_field = my_field if my_field is not None else RecordWithGenericOptional()

    ...

Python throws a TypeError when creating an instance of MyRecord:

In [1]: import combined

In [2]: combined.MyRecord()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 combined.MyRecord()

File /workspaces/yardl/joe/issue-switch-over-union/python/combined/types.py:45, in MyRecord.__init__(self, my_field)
     42 def __init__(self, *,
     43     my_field: typing.Optional[RecordWithGenericOptional[str]] = None,
     44 ):
---> 45     self.my_field = my_field if my_field is not None else RecordWithGenericOptional()

TypeError: RecordWithGenericOptional.__init__() missing 1 required keyword-only argument: 'value'

In RecordWithGenericOptional.__init__, value should be instantiated with a default value of None because it is Optional.

However, yardl currently omits default values for all generic types:

yardl/tooling/internal/python/types/types.go

Lines 181 to 187 in 2d61ba3

 if dsl.ContainsGenericTypeParameter(f.Type) { 

 // cannot default generic type parameters 

 // because they don't really exist at runtime 

 defaultExpressionKind = defaultValueKindNone 

 } else { 

 defaultExpression, defaultExpressionKind = typeDefault(f.Type, rec.Namespace, "", st) 

 }

Implement Python code generation

We will look into wrapping the C++ codegen with pybind11, or whether the Python implementation will be completely separate.

time zone handling in documentation

https://microsoft.github.io/yardl/reference/binary.html#dates-times-and-datetimes isn't clear on time zone handling. DateTimes is clear enough, (although a reference to what "since epoch" means would be good) but the doc on Dates and Times should say this is in UTC presumably. That might be undesirable/confusing though, so maybe it's better to support time zones spec.

Also, I'm assuming the types are actually singular, e.g. DateTime. I think a different font needs to be used for the actual type name (like you use for float etc).

Python RecursionError with aliased generics

Using yardl commit ae9b826 with the following model:

GenericRecord<T>: !record
  fields:
    v: T

AliasedRecord<T>: GenericRecord<T>

AliasedOpenGeneric<T>: AliasedRecord<T>
AliasedClosedGeneric: AliasedRecord<string>

To reproduce, generate Python for this model, then import the generated Python module.
The module won't import, failing with the following error:

Traceback (most recent call last):
  File "/workspaces/yardl/joe/models/bug/python/test.py", line 1, in <module>
    import bug
  File "/workspaces/yardl/joe/models/bug/python/bug/__init__.py", line 21, in <module>
    from .types import (
  File "/workspaces/yardl/joe/models/bug/python/bug/types.py", line 56, in <module>
    get_dtype = _mk_get_dtype()
                ^^^^^^^^^^^^^^^
  File "/workspaces/yardl/joe/models/bug/python/bug/types.py", line 52, in _mk_get_dtype
    dtype_map[AliasedClosedGeneric] = get_dtype(types.GenericAlias(AliasedRecord, (str,)))
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/joe/models/bug/python/bug/_dtypes.py", line 107, in <lambda>
    return lambda t: get_dtype_impl(dtype_map, t)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/joe/models/bug/python/bug/_dtypes.py", line 90, in get_dtype_impl
    return res(get_args(t))
           ^^^^^^^^^^^^^^^^
  File "/workspaces/yardl/joe/models/bug/python/bug/types.py", line 51, in <lambda>
    dtype_map[AliasedOpenGeneric] = lambda type_args: get_dtype(types.GenericAlias(AliasedRecord, (type_args[0],)))
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
RecursionError: maximum recursion depth exceeded in comparison

For this particular model, reordering the aliases as shown below can eliminate the error, but this is a confusing limitation for the user.

AliasedClosedGeneric: AliasedRecord<string>
AliasedOpenGeneric<T>: AliasedRecord<T>

yardl version not yet defined or set during build

Soon we'll need to actually version yardl, which may include:

Defining version in the repo
Configuring build to set version/commit in cmd/yardl.go
Tag and release

yardl/tooling/cmd/yardl/main.go

Lines 10 to 13 in 76ea21d

 var ( 

 // set during build 

 version = "" 

 commit = ""

Union of aliased type compiler error for C++ NDJSON

Using 28aa4af and the following model:

UnionOfAlias: !protocol
  sequence:
    variant: [int, string]
    variantAlias: [AliasedInt, string]

produces the following compiler error for the C++ NDJSON serialization:

/workspaces/yardl/joe/quickcheck/cpp/generated/ndjson/protocols.cc:33:8: error: redefinition of 'struct nlohmann::json_abi_v3_11_2::adl_serializer<std::variant<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >'
   33 | struct adl_serializer<std::variant<check::AliasedInt, std::string>> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/workspaces/yardl/joe/quickcheck/cpp/generated/ndjson/protocols.cc:14:8: note: previous definition of 'struct nlohmann::json_abi_v3_11_2::adl_serializer<std::variant<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >'
   14 | struct adl_serializer<std::variant<int32_t, std::string>> {
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

licensing of generated code

From @hansenms 's email

The yardl tool is MIT license.
During code generation, there are some hand crafted files that are copied with the generated code, they are MIT too, we should stick that license on them.
The yardl specification that the user writes can be whichever license they want. We would recommend MIT.
The code that is truly generated can have any license that the user wants. We recommend MIT.
We should add a feature to yardl, that allows you to put a license file in the _package directory, it should be emitted with the generated code. We may have to think about how that is transparent in the generated directory which license applies to what. There could be a mix of licenses in there.

Support maps

We should support maps/dictionaries as a first-class datatype. Syntax could be something like:

x: !map
  keys: string
  values: int

Keys can only be primitive scalar types.

Shorthand syntax could look like:

string->int

We should also make sure maps can be used in computed fields.

use / support for python / numpy array api

Dear Yardl developers,

would it be (in principle) possible to use numpy.array_api instead of numpy as array backend for the generated python code?
Note that numpy.array_api is a reference implementation of the array API standard.

By doing so, Yardl would be compliant with the python array api and as such more agnostic to the specific array backend which would potentially allow using other compliant array backends (e.g. cupy or pytorch) in the future.

Georg

PS: @johnstairs @hansenms thanks for the support of the 1st ETSI Hackathon

xtensor required version undocumented

For whatever reason, my conda install got xtensor=0.21.10. The generated code fails to compile though as it xtensor_container doesn't have the flat member. xtensor-stack/xtensor@50e3d42 says this means at least 0.23.10 is required.

Ideally, this minimum version should be added to the generated CMakeLists.txt.

Of course, the same holds for other dependencies.

Extra \0 byte written in Python binary.StreamSerializer when value is empty

If the user writes an empty iterable to a binary stream, the underlying StreamSerializer should not write a 0 byte. The 0 byte used to terminate a serialized stream is written elsewhere. Currently, yardl does this:

yardl/tooling/internal/python/static_files/_binary.py

Lines 958 to 959 in 7a0ab26

 if isinstance(value, list): 

 stream.write_unsigned_varint(len(value))

How to reproduce:

Change the Simple protocol round trip test to write a mixture of empty and non-empty streams. It currently writes values to each stream in the first part of the test, then writes only "empty" iterables in the second part:

yardl/python/tests/test_protocol_roundtrip.py

Lines 584 to 608 in 7a0ab26

 def test_simple_streams(format: Format): 

 c = create_validating_writer_class(format, tm.StreamsWriterBase) 

 with c() as w: 

 w.write_int_data(range(10)) 

 w.write_int_data(range(20)) 

 w.write_optional_int_data([1, 2, None, 4, 5, None, 7, 8, 9, 10]) 

 w.write_record_with_optional_vector_data( 

 [ 

 tm.RecordWithOptionalVector(), 

 tm.RecordWithOptionalVector(optional_vector=[1, 2, 3]), 

 tm.RecordWithOptionalVector( 

 optional_vector=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 

 ), 

 ] 

 ) 

 w.write_fixed_vector(([1, 2, 3] for _ in range(4))) 

 # empty streams 

 with c() as w: 

 w.write_int_data(range(0)) 

 w.write_optional_int_data([]) 

 w.write_record_with_optional_vector_data([]) 

 w.write_fixed_vector([])

Add the following validation:

    # mixed empty and non-empty streams
    with c() as w:
        w.write_int_data(range(0))
        w.write_optional_int_data([1, 2, None, 4, 5, None, 7, 8, 9, 10])
        w.write_record_with_optional_vector_data([])
        w.write_fixed_vector(([1, 2, 3] for _ in range(4)))

The test will fail.

Note: Adding this validation to test_simple_streams uncovered another unrelated bug in NDJsonProtocolReader._read_json_line. Separate issue.

Support shorthand for vectors and arrays

We could support some syntactic sugar for !vector and !array. Perhaps something like:

int* # a vector of int of unknown length
int*3 # a vector of ints of length 3

int[] # an array of ints with an unknown number of dimensions
int[,] # an array of ints with two dimensions
int[x,y] # an array of ints with two named dimensions
int[3,4] # an array of ints with two fixed dimensions
int[x:3, y:4] # an array of ints with two named and fixed dimensions

Choice of C++ ndarray type

Creating a separate issue based on #20 opened by @KrisThielemans

Also, somewhere in the doc we'll need a description of mappings between yardl types and C++ and other target languages. In particular, I believe you generate your own multi-dim array type as there still doesn't seem to be an std container sadly.

It could be useful to support a few existing multi-dim arrays to avoid copies in client-code (Boost.MultiArray and https://amypad.github.io/CuVec/ come to mind), but I can see that becoming very difficult. (If a mapping to a flat array is exposed somewhere, it'd need to be stated if row-major or column-major order is used).

These are good points. We currently use xtensor types for multidimensional arrays that we alias here. These have a .data() method that exposes the raw flat array.

I think we have some choices for this problem:

We implement our own ndarray types that provide the minimum API surface and aim to make interop with other libraries "easy".
We support a number of different libraries and generate different code depending on a setting in the _package.yaml.
Implement both of the above, since they are not mutually exclusive, with (1) being the default.

Related problem: in some instances, perhaps the memory should be allocated on the GPU. Should this be a be a property on the !array in yardl?

Python syntax/type errors when default union type is list or dict

Given the following model:

GenericUnionsRecord<T, U>: !record
  fields:
    a: !union
      tv: T*
      t: T
    b: !union
      tm: T->U
      t: T

yardl v0.4.0 generates invalid constructor code for both inner unions:

class GenericUnionsRecord(typing.Generic[T, T_NP, U]):
    ...

    def __init__(self, *, ...):
        self.a = a if a is not None else TvOrT.Tv([]())
        self.b = b if b is not None else TmOrT.Tm({}())

Warnings on import (these are TypeErrors at runtime):

/workspaces/yardl/joe/issue-#112/python/odd/types.py:58: SyntaxWarning: 'list' object is not callable; perhaps you missed a comma?
  self.a = a if a is not None else TvOrT.Tv([]())
/workspaces/yardl/joe/issue-#112/python/odd/types.py:59: SyntaxWarning: 'dict' object is not callable; perhaps you missed a comma?
  self.b = b if b is not None else TmOrT.Tm({}())

This issue occurs when the first type in the union resolves to a Python list or dict.

C++ optional build of HDF5/NDJSON support

I suggest to add

option(${prefix}_HDF5_SUPPORT "Add HDF5 protocol" ON)

or similar, and also for NDJSON. Could be advanced options. This would allow the advanced to switch off something that they don't need.

Python TypeError instantiating record that contains aliased generic field

Using yardl commit ab1e2b with the following model:

GenericRecord<T>: !record
  fields:
    v: T

AliasedRecord<T>: GenericRecord<T>

MyRecord: !record
  fields:
    myField: AliasedRecord<int>

To reproduce, generate Python for this model, then import the generated Python module and create an instance of MyRecord with no arguments.
Python will complain that MyRecord.__init__() is missing the keyword argument for my_field:

Python 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import issue_082
>>> r = issue_082.MyRecord()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: MyRecord.__init__() missing 1 required keyword-only argument: 'my_field'

The generated code for MyRecord looks like this:

class MyRecord:
    my_field: AliasedRecord[yardl.Int32]

    def __init__(self, *,
        my_field: AliasedRecord[yardl.Int32],
    ):
        self.my_field = my_field

If I remove the AliasedRecord from the model and use GenericRecord directly, I get the expected class definition for MyRecord, and it works:

class MyRecord:
    my_field: GenericRecord[yardl.Int32]

    def __init__(self, *,
        my_field: typing.Optional[GenericRecord[yardl.Int32]] = None,
    ):
        self.my_field = my_field if my_field is not None else GenericRecord(v=0)

The relevant code is

yardl/tooling/internal/python/types/types.go

Lines 179 to 195 in ae9b826

 var defaultExpression string 

 var defaultExpressionKind defaultValueKind 

 if dsl.ContainsGenericTypeParameter(f.Type) { 

 // cannot default generic type parameters 

 // because they don't really exist at runtime 

 defaultExpressionKind = defaultValueKindNone 

 } else { 

 defaultExpression, defaultExpressionKind = typeDefault(f.Type, rec.Namespace, st) 

 } 

 switch defaultExpressionKind { 

 case defaultValueKindNone: 

 w.WriteString(fieldTypeSyntax) 

 case defaultValueKindImmutable: 

 fmt.Fprintf(w, "%s = %s", fieldTypeSyntax, defaultExpression) 

 case defaultValueKindMutable: 

 fmt.Fprintf(w, "typing.Optional[%s] = None", fieldTypeSyntax) 

 }

Support for column-major layout in binary format?

The C++ types for multi-dimensional arrays (yardl::FixedNDArray, yardl::NDArray, and yardl::DynamicNDArray) are all locked to row-major layout at compile-time.

We will eventually support languages that default to column-major ordering. HDF5 requires data to be written in row-major order, so we will need to convert. For the binary format, we could do the same, or we could prefix each array with a byte indicating the layout. This could avoid expensive permutations if readers and writers are both working with column-major ordering.

Python union classes generated with duplicate type parameters

Using the following model:

GenericUnion<T>: !union
  t: T
  tv: T*
  tvf: T[]

yardl v0.4.0 generates invalid Python:

class GenericUnion(typing.Generic[T, T_NP, T, T_NP, T, T_NP]):

Error message on import:

TypeError: Parameters to Generic[...] must all be unique

mypy generates lots of warnings on generated code

For instance, on https://github.com/ETSInitiative/PRDdefinition/tree/main/python

$ mypy prd_generator.py 
prd/yardl_types.py:270: error: No overload variant of "zip" matches argument types "void", "void"  [call-overload]
prd/yardl_types.py:270: note: Possible overload variants:
prd/yardl_types.py:270: note:     def [_T_co, _T1] __new__(cls, Iterable[_T1], /, *, strict: bool = ...) -> zip[tuple[_T1]]
prd/yardl_types.py:270: note:     def [_T_co, _T1, _T2] __new__(cls, Iterable[_T1], Iterable[_T2], /, *, strict: bool = ...) -> zip[tuple[_T1, _T2]]
prd/yardl_types.py:270: note:     def [_T_co, _T1, _T2, _T3] __new__(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], /, *, strict: bool = ...) -> zip[tuple[_T1, _T2, _T3]]
prd/yardl_types.py:270: note:     def [_T_co, _T1, _T2, _T3, _T4] __new__(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4], /, *, strict: bool = ...) -> zip[tuple[_T1, _T2, _T3, _T4]]
prd/yardl_types.py:270: note:     def [_T_co, _T1, _T2, _T3, _T4, _T5] __new__(cls, Iterable[_T1], Iterable[_T2], Iterable[_T3], Iterable[_T4], Iterable[_T5], /, *, strict: bool = ...) -> zip[tuple[_T1, _T2, _T3, _T4, _T5]]
prd/yardl_types.py:270: note:     def [_T_co] __new__(cls, Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], Iterable[Any], /, *iterables: Iterable[Any], strict: bool = ...) -> zip[tuple[Any, ...]]
prd/yardl_types.py:299: error: "object" has no attribute "value"  [attr-defined]
prd/_ndjson.py:48: error: Incompatible types in assignment (expression has type "TextIO", variable has type "TextIOWrapper")  [assignment]
prd/_ndjson.py:86: error: Incompatible types in assignment (expression has type "BufferedReader | TextIO", variable has type "TextIOWrapper")  [assignment]
prd/_ndjson.py:940: error: <nothing> has no attribute "to_json"  [attr-defined]
prd/_ndjson.py:958: error: <nothing> has no attribute "from_json"  [attr-defined]
prd/_ndjson.py:993: error: Incompatible types in assignment (expression has type "None", variable has type "tuple[int, ...]")  [assignment]
prd/_ndjson.py:1024: error: Need type annotation for "result"  [var-annotated]
prd/_binary.py:1071: error: Incompatible types in assignment (expression has type "None", variable has type "tuple[int, ...]")  [assignment]
prd/_binary.py:1076: error: <nothing> has no attribute "_element_serializer"  [attr-defined]
prd/_binary.py:1115: error: Need type annotation for "result"  [var-annotated]

basic algebra on computed fields

It'd be nice to be able to do some basic manipulations for a computed field, e.g. subtracting 1

ScannerInformation: !record
  fields:
    # edge information for TOF bins in mm (e.g. start,edge1, ... end)
    tofBinEdges: float*
  computedFields:
    numberOfTOFBins: size(tofBinEdges)-1

multi-dim arrays

https://github.com/microsoft/yardl/blob/main/docs/docs.md#computed-fields states

MyRec: !record
  fields:
    arrayField: !array
        items: int
        dimensions: [x, y]
  computedFields:
    accessArrayElementByName: arrayField[y:1, x:0]

this swaps the order between dimensions and access. Is this intentional? It'd be very confusing!

Also, somewhere in the doc we'll need a description of mappings between yardl types and C++ and other target languages. In particular, I believe you generate your own multi-dim array type as there still doesn't seem to be an std container sadly.

It could be useful to support a few existing multi-dim arrays to avoid copies in client-code (Boost.MultiArray and https://amypad.github.io/CuVec/ come to mind), but I can see that becoming very difficult. (If a mapping to a flat array is exposed somewhere, it'd need to be stated if row-major or column-major order is used).

Empty protocol results in an invalid C++ CopyTo method

Currently, yardl permits an empty Protocol (one with zero steps in its sequence).

In C++ codegen, an empty Protocol results in an unused writer parameter in the protocol reader's CopyTo method.

Model:

MyProtocol: !protocol
  sequence:

Generated CopyTo method:

void MyProtocolReaderBase::CopyTo(MyProtocolWriterBase& writer) {
}

Compiler error:

[2/7] Building CXX object generated/CMakeFiles/issue_generated.dir/protocols.cc.o
FAILED: generated/CMakeFiles/issue_generated.dir/protocols.cc.o 
...
.../cpp/generated/protocols.cc: In member function 'void issue::MyProtocolReaderBase::CopyTo(issue::MyProtocolWriterBase&)':
/workspaces/yardl/joe/issue-#ddd/cpp/generated/protocols.cc:73:57: error: unused parameter 'writer' [-Werror=unused-parameter]
   73 | void MyProtocolReaderBase::CopyTo(MyProtocolWriterBase& writer) {
      |                                   ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
cc1plus: all warnings being treated as errors

Either empty Protocols should not be permitted, or I'll add a cast to void to silence the compiler.

Unable to create switch expression of type string

Using yardl v0.4.0, I expect to be able to use a computed field switch expression to produce a string value:

RecordWithComputedFields: !record
  fields:
    myField: [null, string, float]
  computedFields:
    myResult:
      !switch myField:
        null: "null"
        string: "string"
        float: "float"

But yardl complains with:

❌ /workspaces/yardl/joe/switch-case-string/model/model.yml:6:14: there is no variable in scope with the name 'null' nor does the record 'RecordWithComputedFields' does not have a field or computed field named 'null'
❌ /workspaces/yardl/joe/switch-case-string/model/model.yml:7:18: there is no variable in scope with the name 'string' nor does the record 'RecordWithComputedFields' does not have a field or computed field named 'string'
❌ /workspaces/yardl/joe/switch-case-string/model/model.yml:8:17: there is no variable in scope with the name 'float' nor does the record 'RecordWithComputedFields' does not have a field or computed field named 'float'

Support flags enums

We should have a special kind of enum for flags that are meant to be bitwise ORed together:

!flags
  values:
    - none
    - red
    - green
    - blue

The first value will always be 0.

As with enums, you can specify the base type and integer values:

!flags
  values:
    none: 0
    red: 1
    green: 2
    blue: 4

Allow specifying HDF5 group path

When writing a protocol to an HDF5 file, we create a group with the protocol's name. It you wanted to store multiple experiments with the same protocol in the same file, we could have an optional path parameter that specifies the group to put the protocol in.

doc on enums unclear

https://microsoft.github.io/yardl/reference/binary.html#enums-and-flags has a typo "properly" , but it is unclear how the base type would be defined. A link to somewhere else?

Python codegen does not default generic vector fields

There is another issue.

MyRec<T>: !record
  fields:
    a: T*

Does not default the field, whereas this does:

MyRec: !record
  fields:
    a: int*

(An existing issue, not a regression)

Originally posted by @johnstairs in #96 (comment)

Size is a valid enum type for code generation but fails compilation

When you define an enum with a size base, you succeed in code generation, but the code fails to compile

MyEnum: !enum
  base: size 
  values:
    a: 1 
    b: 2
    c: 3

	if dsl.ContainsGenericTypeParameter(f.Type) {
	// cannot default generic type parameters
	// because they don't really exist at runtime
	defaultExpressionKind = defaultValueKindNone
	} else {
	defaultExpression, defaultExpressionKind = typeDefault(f.Type, rec.Namespace, "", st)
	}

	if isinstance(value, list):
	stream.write_unsigned_varint(len(value))

	def test_simple_streams(format: Format):
	c = create_validating_writer_class(format, tm.StreamsWriterBase)

	with c() as w:
	w.write_int_data(range(10))
	w.write_int_data(range(20))

	w.write_optional_int_data([1, 2, None, 4, 5, None, 7, 8, 9, 10])
	w.write_record_with_optional_vector_data(
	[
	tm.RecordWithOptionalVector(),
	tm.RecordWithOptionalVector(optional_vector=[1, 2, 3]),
	tm.RecordWithOptionalVector(
	optional_vector=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
	),
	]
	)
	w.write_fixed_vector(([1, 2, 3] for _ in range(4)))

	# empty streams
	with c() as w:
	w.write_int_data(range(0))
	w.write_optional_int_data([])
	w.write_record_with_optional_vector_data([])
	w.write_fixed_vector([])

	var defaultExpression string
	var defaultExpressionKind defaultValueKind
	if dsl.ContainsGenericTypeParameter(f.Type) {
	// cannot default generic type parameters
	// because they don't really exist at runtime
	defaultExpressionKind = defaultValueKindNone
	} else {
	defaultExpression, defaultExpressionKind = typeDefault(f.Type, rec.Namespace, st)
	}
	switch defaultExpressionKind {
	case defaultValueKindNone:
	w.WriteString(fieldTypeSyntax)
	case defaultValueKindImmutable:
	fmt.Fprintf(w, "%s = %s", fieldTypeSyntax, defaultExpression)
	case defaultValueKindMutable:
	fmt.Fprintf(w, "typing.Optional[%s] = None", fieldTypeSyntax)
	}