hollasch / lson Goto Github PK

Lucid Serialized Object Notation

C++ 83.21% Makefile 13.20% CMake 3.59%

lson's Introduction

Here's my stuff. Feel free to ask questions or log issues.

Books

Site	Description
raytracing.github.io	The Ray Tracing in One Weekend book series. Three online books (with source) on how to write your own raytracer

Tools

Repo	Description
calendar	Windows command-line tool to print specific months or years
csub	Command substitution for the Windows command line
drives	Windows command-line tool to display all current drive information
eol	Windows command-line filter to convert End-Of-Line character sequences
ftimecomp	Windows command-line tool for the comparison of file timestamps
hex	Windows command-line hexadecimal dump utility
pathmatch	Windows command-line tool to print directories and files matching wildcard specifications of arbitrary depth.
timeprint	Nice Windows command-line tool to very flexibly print elapsed, relative or absolute times and dates
win-scripts	A collection of useful Windows CMD scripts and fragments

Experimental / Miscellaneous

Repo	Description
fpWorkbench	A set of C++ programs for experimenting with floating point numbers
gibber	I have a personal Perl script to generate English-like gibberish (for password inspiration). Someday I hope to port this to multiple platforms as a C++ tool.
hollasch	Magical repository that allows me to have a profile README
jumpdir	Sigh. Someday I'll get to this V2 port of a very awesome tool I wrote years ago.
LSON	Needs a better name. Yet another rethinking of JSON, with tables, graphs, and much better freedom for expressing arbitrary data types.
pi-lab	Archive of some Raspberry Pi Python scripts I wrote a while ago.
ray4	Source + doc of my master's thesis Four-Space Visualization of 4D Objects. 4D raytracer.
srhlab	Random fragments and notes to document various programming concepts I've learned.

lson's People

Contributors

Stargazers

Watchers

Forkers

chollasch listato

lson's Issues

Add clarification about case sensitivity

While it is possible to properly perform case-insensitive comparison, the process is complicated and must be explicit in the face of multiple locales. We could specify that all identifiers (keys, element types, and other IDs) are transformed to lowercase in the English locale, or processed with a standard foldcase implementation, this becomes onerous when behavior must be identical across any number of implementations, software languages and libraries.

In the interest of keeping the specification simple and easy to implement, simply dictate that all identifiers are case-sensitive, and possibly advocate for a convention of using lowercase (perhaps).

Another option—though messy—would be to specify that all ASCII characters will be compared case-insensitive in the English locale. It's an exception that would get a lot of utility, but is also an ugly inconsistency.

Add identifiers and reference syntax — grammar

Identifiers

Form 1

[ @foo ... ]                       // Array
{ @bar ... }                       // Dictionary
[# @baz [ @qux ... ] ... #]        // Can reuse whole table or just schema (not body only)
[% @gor [ @hoz ... ] [ ... ] %]    // Can reuse whole graph or just nodes (not edges only)
[% @lof { @oog ... } { ... } %]    // ^

Form 2

zuz=#ae32fd                         // Can be used with any value
foo=[ ... ]                         // Array
bar={ ... }                         // Dictionary
baz=[# qux=[ ... ] ... #]           // Can reuse whole table or just schema (not body only)
gor=[% hoz=[ ... ] [ ... ] %]       // Can reuse whole graph or just nodes (not edges only)
lof=[% oog={ ... } { ... } %]       // ^

References

^foo                // Value expected, cannot use schema or nodes(?)
[# ^grotz ... #]    // Schema expected, must use schema
[% ^quonk ... %]    // Nodes expected, must use nodes

Arrays and dictionaries require better documentation

The current sections assume that the reader is already well-versed in the JSON array/dictionary objects. I think these sections need more guidance and examples, as they're very terse right now.

Suggest value reference syntax

This isn't really LSON official syntax per se, but it's a way that value references could be expressed in something like an LSON query tool, or some element extraction API.

Syntax sketch:

Object

{
    thing: { a:x b:y c:z }
}

thing.a
thing[a]
thing{1}   // Second property name?

Array

{
    thing: [ a b c ]
}

thing.0
thing[0]

Table

{
    thing: [#
        a b c :
        1 2 3
        4 5 6
        7 8 9
    #]
}

thing[2].a
thing.2.a

Graph

{
    thing: [%
        3                      // Unnamed nodes without data
        [ 0>1 1>2 2-0 ]
    %]
}

No meaningful node reference.

{
    thing: [%
        [ a b c ]              // Unnamed nodes with data
        [ 0>1 1>2 2-0 ]
    %]
}

thing[0][0] // Node 0 data
thing.0.0   // Node 0 data

{
    thing: [%
        { [ a b c ]: null }   // Named nodes with dummy data
        [ a>b b>c c-a ]
    %]
}

How to enumerate property names?

thing[0]{0} ?

{
    thing: [%
        { a:1 b:7 c:10 }      // Named nodes with data
        [ a>b b>c c-a ]       // Edges without data
    %]
}

thing[0][b]   // Node 'b' data
thing[1]{1}   // Edge 1 "name"

{
    thing: [%
        { a:1 b:7 c:10 }                  // Named nodes with data
        { a>b:red b>c:green c-a:blue }    // Edges with data
    %]
}

thing[1][b>c]   // Edge b>c data

Update table schema syntax

Get rid of the separating semicolon, enclose in square brackets, switch to optional colon to define default value.

[# [ feature1  feature2:foo  feature3:(float:1) ]    // Dictionary style
   // Rows
#]

Revise doc: four structure types + three value types

Update the documentation to indicate that LSON has three value types (string, scalar, word) and four structure types (array, dictionary, table, graph).

Optional table form - empty header reads column names from first row

Add an optional table form. If no strings occur between table opener (<) and fields separator (:), then read the table column names from the first row, which must contain only string values. Either delimited with optional [ ] brackets, or from colon to first of:

Newline / Line Feed (U+0a)
Carriage Return (U+0d)
Next Line (U+85)
Line Separator (U+2028)
Paragraph Separator (U+2029)

Example:

<:
"Date", "Source", "Destination", "Mass", "Transit Time"
...
>

Revise scalar value design

To force scalar (non-string) interpretation, surround with parentheses. Optionally, scalars may be prefixed with type name, as in (int32:0xff000102) or (csscolor:#e88040).

Unquoted strings are tested for first match against a context-sensitive set of scalar recognizers. If unrecognized by the current set of types, then treat as a string (but preserve unquoted nature when re-encoding).

Abandon node set using array of names

Currently graph nodes may be named and without data using the convention [ a b c ]. This precludes using node indices for the graph edges, and is confusing when node names are an arbitrary set of numeric values (not starting at zero and increasing by one).

Ideally, though, it would be nice to allow named nodes that are referenced by index, to prevent file size explosion for some cases.

Consider solving this in two steps:

Allow for the specification of dictionaries without data, like so:

{ violet; indigo; blue; green; yellow; orange; red }

In this example, the names are immediately terminated without specifying an associated value. In this case, what happens outside of a graph context? Perhaps this is just a set of string values?

Promote commas and semicolons to formal tokens. Currently, they're just treated as whitespace. Instead, specify where they can optionally occur. In this case, they become syntactically meaningful.

Add graph variant with implicit/auto node set

Introduce either a keyword (such as auto or implicit) for the graph node set, where all nodes mentioned in the edge set form the node set.

One primitive type: value, plus one natively supported value type: string

This simplifies things tremendously. LSON has only two unstructured value types: string and scalar.

String values are string only, and must be quoted as such.
Scalars have two values: their string representation and their native value
Scalar native values may or may not be supported by the application
Scalars may omit type. In this case, it is up to the application to infer type

The following LSON fragments all yield an untyped scalar:

true
(true)
(:true)

When encoding scalars, it is up to the application to chose whether to specify or omit their type. If types are omitted, the application may choose to encode scalars as bare words. Thus, the following transforms are possible:

true → (boolean:true)
true → (:true)
true → (true)
true → true

Define default feature value syntax for tables

Consider a table syntax that allows you to define default values for table features.

Trailing Undefined

[# name color weight = null black 0.00 :
    [ Alexander   blue   1.37 ]    // = "Alexander",  "blue",  "1.37"
    [ Brigitte                ]    // = "Brigitte",   "black", "0.00"
    [ Clodthorpe  green       ]    // = "Clodthorpe", "green", "0.00"
#]

Alternatively,

[# name=null color=black weight=0.00 :

I think I like this better. This allows for sparse defaults, and makes the association clear in the
case of many features.

Interior Undefined

Oops. I didn't formally document , and ; terminators (see issue #16). We might want to get more
formal about how we use this. For example, consider:

Alexander, , 1.37;   // Use default value for color
Brigitte;            // Use default value for color, weight
Clodthorpe green;    // Use default value for weight

Document commas and semi-colons

This one slipped through the cracks. Though , and ; characters are used throughout the LSON document, they're not explicitly documented anywhere.

So far, they've been used as syntactic sugar to optionally terminate values. Now that we might tackle default values for tabular data, we need to actually nail down their syntax.

Note that they're currently defined in the grammar section, as <terminator> lexemes.

Tables with dictionary rows

The table structure provides a way to ensure consistent data features, along with a way to default missing values. Given this, it would be useful to have tables with dictionary rows, for cases where it's useful to specify sparse information using the standard defaulting mechanism of tables. This offers several benefits over an array of objects:

Explicit field names increases readability, and can be order independent.
Fields can have default types and values.
Fields (with optional type information) can be required.
Unrecognized fields can be reported as errors.

For example:

[#
    a=true, b=1.0, c="foo", d=none, e=normal, f=100%, g=[[1 0][0 1]]
    :
    { b:22.3, d:all }
    { g:rotate(30), f:50%, c:"bar" }
    { a:false, e:heavy }
#]

The result is a data table with three rows, where each row has all values defined, either explicitly or via defaults.

In addition to the above advantages, tables can catch errors based on violations of the schema declaration:

[#
    a=(boolean:), b, c="foo", d=none, e=normal, f=100%, g=[[1 0][0 1]]
    :
    { a:null ... }
    //  ^ Error: type mismatch (detected in clients that care)

    { b:3.7 }
    //      ^ Error: required feature 'a' not defined (it has no default)

    { a:true, b:4.6, x:red }
    //               ^ Error: unrecognized feature 'x'
#]

Whole-doc revision pass

Time for the next pass on the document. There are still flow issues.

Elevate schema to a first-class construct, deprecate table?

Or put another way, should a table just be an array with a schema? Note that graphs also employ schemas, which is essentially just a vector of name-type information. Current approach:

[# // Table
    [ id:(integer32:0) name:(text:) rank:(double:0) ]  // First array is interpreted as a schema
    [ ... ] // Data row 1
    [ ... ] // Data row 2
    [ ... ] // Data row 3
    ...
#]

Instead, if a schema was enclosed in [# ... #] delimeters (to use a stand-in for now), a table would just be an array with a schema:

[ // Regular array (so far)
    [#[ ... ]#]  // Schema must be first element. A schema is an array of name-type information.
    [ ... ]   // Data row 1
    [ ... ]   // Data row 2
    [ ... ]   // Data row 3
    ...
]

It is still an open question whether we should allow rows without explicit brackets. That is, given a schema with N elements, the first N elements would define row 1, the next N elements row 2, and so on. It would be an error for an array with schema of size N to have M elements, where M mod N ≠ 0.

Note that a typed array is then just a table with a schema of a single component:

[[#(double:0)#] 1, 3.4, -8.6, 10.2, 0, 0, 1.8 ]

There's a possible problem with this approach, since graph nodes or edges can be tables. If we eliminate a special table signifier (something in [# ... #] delimeters), then the "array with a schema is a table" has to be well understood. I think that's ok.

Surrender to the dark side: formalize metadata structure?

My prior plan was to rely on an ad hoc, use-case defined waterfall of element string "recognizers". However, that's an unformalized, error-prone approach, and can lead to erroneous assignment of types.

This recently was in the news, as some DNA sequences look like dates to Excel when importing CSVs: https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates

One could easily imagine similar cases in a world of unlimited types.

I have been very resistant to the idea of formalizing a metadata logic in LSON, but I the above example forces me to throw in the towel.

I'll need to think about this much more, but here's a representative sketch:

{!
    aliases: {
        foo: bar
        baz: qux
        number: (real:)
        infinity: (real:infinity)
        empty: ()
    }
    auto-type: [ null, bool, real, css-color ]
    ...
    body: ...
!}

Of course, this "solution" may well be insufficient. For example, if the auto-type waterfall has date, dna, then all we did was just formalized the weakness that led to the problem referenced above.

Further, how could we avoid the bloat associated with all this formalism up front? This is reminiscent of the old Direct3D .x file format, or the XML boilerplate bloat associated with many applications.

An alternative is to eliminate the concept of a waterfall altogether (or perhaps that's a recommendation?), and instead, require explicit types everywhere, and alleviate the pain with aliases.

Convert to CMake

Start building C++ parser

Ref: https://unnikked.ga/how-to-build-a-boolean-expression-evaluator-518e9e068a65

Formalize ID/key syntax

The documentation for type IDs presents a very flexible way of handling types containing spaces. The current documentation for dictionaries kind of implicitly suggests that keys with spaces should be either quoted or escaped. However, the same mechanism described in element types can apply to dictionary keys.

With element types, the lexer is greedy for colons, and the first bare colon encountered is considered a lexical element. In general, dictionary keys and element types should be handled in the same way.

One concern is the fact that by not requiring quoting of keys/IDs that contain whitespace, there are potential issues with lexer, which would need to preserve whitespace exactly. That may be necessary regardless, but could make things messy. For example, line-ending transforms could alter type names that contain such characters.

Jettison references and instancing?

The syntax and rules feel forced and unnatural. For example, parsing is definitely complicated by the fact that you don't know if a reference is being defined until you encounter (or don't) an equal sign.

Secondly, the lexical scoping simplifies things, but is likely not intuitive.

The primary problem with introducing references is that we can only define these in the location of standard value definitions.

I think the added value isn't worth the additional cost.

For the time being, the LSON spec should be versioned

This will also be handy when coding against various versions in the early development phase.

Nodes with tabular data

Support nodes with tabular data? For example:

[%
    [#
        [ id label="" weight=(number:0) ] :
        [ 0 red   10 ]
        [ 1 blue  25 ]
        [ 2 green 37 ]
    #]
    ...
%]

The challenge in this scenario is figuring out the node ID. Two options come to mind: either look for a table feature with the special name "id" (or "node_id" or "_node_id"), or select the first feature as the node id.

In both cases, the parser must ensure that the node IDs are all unique. It should be a syntax error if two rows specify the same node id.

Another option would be to refer to nodes via their row index, starting at zero. This has the advantage that it's already similar to arrays of node data, doesn't suffer from the possibility of ID collision, is more compact when representing edges, and doesn't require a unique identifier in the node data (which in some cases might have to be generated for just this purpose).

Perhaps these two approaches could be combined? One could allow either the row index, or the values of the first feature. I don't think that would work, because there could arise a situation where the first feature contains counting values. Now the node references are ambiguous, as it could be using either.

Now leaning toward using the first feature. It's easy to inject serial numbers into the table data, and it makes it much easier for human readability.

Conclusion: First feature = row ID.

Allow paths in graph edge declarations

For example a - b - c - d - e or a → b → c ← e ← f.

Note that parsing becomes a matter of caching the prior node. A sequence of two nodes is fine (starting a new base node), but a sequence of three nodes is an error. For example:

a → b → c d → e             // a→b, b→c, d→e
a → b → c f d → e           // Error: expected edge type but found 'd'

Provide a way to specify raw strings (no backslash replacements)

Elements need rich delimiters

Consider an element of type "lambda" with value "(x,a,b) => a <= x && x <= b", or value "(a,b) => a < b ? a : b". These and other complex values can always be string escaped (since LSON supports six different string delimiters), but might benefit from other more flexible delimiter syntax. Another complex example would be an element whose value is a 100 lines of YAML.

Consider something like one of the following

(_ type : ... _)
(foo( type : value )foo)
((begin type : value end))
((delim type : value delim))

New graph type: adjacency matrix

We need a way to express adjacency matrices, as these are often the most efficient graph expression. Here's what we'd have to do today (for a Markov chain adjacency matrix, where every edge has a real number value):

[%
    4
    {
        0 > 0: 0.3
        0 > 1: 0.3
        0 > 2: 0.3
        0 > 3: 0.1

        1 > 0: 0.3
        1 > 1: 0.5
        1 > 2: 0.1
        1 > 3: 0.1

        2 > 0: 0.4
        2 > 1: 0.4
        2 > 2: 0.1
        2 > 3: 0.1

        3 > 0: 0.1
        3 > 1: 0.8
        3 > 2: 0.05
        3 > 3: 0.05
    }
%]

Here's what it would look like if the parser switched to adjacency representation on seeing [ instead of an edge definition for the set of edges:

[%
    4
    [
        [0.30  0.30  0.30  0.10]
        [0.30  0.50  0.10  0.10]
        [0.40  0.40  0.10  0.10]
        [0.10  0.80  0.05  0.05]
    ]
%]

One candidate for additional special syntax:

[%#
    [0.30  0.30  0.30  0.10]
    [0.30  0.50  0.10  0.10]
    [0.40  0.40  0.10  0.10]
    [0.10  0.80  0.05  0.05]
#%]

Here's a thought. Consider this:

[%
    [
        [0.30  0.30  0.30  0.10]
        [0.30  0.50  0.10  0.10]
        [0.40  0.40  0.10  0.10]
        [0.10  0.80  0.05  0.05]
    ]
%]

Here, the parser hits the node data, so we end up with four unnamed nodes, each of which has a four-element array. If the closing %] is immediately encountered (no edge data given), and because each node is unnamed and because each node has an N-element array (where N is the number of nodes), then the graph is an adjacency matrix, and each vector is considered a row of the adjacency matrix.

For a bonus, if there are N nodes and each node M has an array of length N-M, then the adjacency matrix is upper-triangular.

Grammar needs formal identifier type

Distinguish identifiers from words/strings, when used as key names, table column names, and node names.

Convert examples doc to Markdeep, hosted on GitHub pages site

Relent on `null` special value?

Null would be useful in a number of scenarios:

Table data, where missing data is formally specified
Graphs, where nodes/edges may not have associated data
General values

JSON uses null as a keyword. Unfortunately, this would be the first keyword introduced into LSON. We may also denote a single character (perhaps _) as a shorthand for null. If so, then LSON should also recognize the Unicode character ∅.

Convert docs to Markdeep, hosted on https://hollasch.github.io/lson.

Multi-edge support problematic?

Graph paths, one-to-many and many-to-one specifications — do they have a single data blob, or do they replicate the data blob for each single edge? Similarly, with two paths a → b → c and a → b → d, are there two edges a → b?

I'm wondering if it's problematic and of little profit to be able to define more than one edge at a time.

Formalize JSON equivalent notation

Primary targets are elements, tables and graphs. Some ideas:

Elements

(elementType:thing)

{ "type": "elementType", "value": "thing" }

Tables

[# a=(typeA:defaultA) b=(typeB:defaultB) : [ r1a r1b ] [ ~ ~ ] ]

{
"type": "lson-table",
"columns": [
    { "name": "a", "type": "typeA", "default": "defaultA" }
    { "name": "b", "type": "typeB", "default": "defaultB" }
],
"rows": [
    [ "r1a", "r1b" ],
    [ "~",   "~" ]
]
}

Graphs

[%
...
%]

Formatting error in section 5.4.3.1

Looks like it should be a code block but isn't.

Prefer the term "text" over "string"?

"string" was originally chosen for a data structure that holds a string of characters. It's common vernacular, so perhaps should continue to be used. However, a more accurate alternative would be "text".

There are two places where this is important: in the body and terminology of the spec, and also as the formal standard name of the data type, as in (text: a bunch of words here).

Graph feature: representing multiple children/parents

Consider an edge list like this:

[
    0 -> [4 27 51]
    1 -> [5 7]
    2 -> [3, 6]
     ...
]

This would be useful for things like DAGs, abstract syntax trees, scene graphs, and so forth.

This would allow many sources to one target, or many targets from one source. It would not support many sources to many targets.

Formalize null element

Add provision for () as a null element type.

Add new multi-key syntax for dictionaries

Refer to issue #4.

This provides a nice mechanism for graph nodes with names but no (or trivial) data:

[%
    { [ red orange yellow green blue indigo violet ] : null }
    [ red → yellow;  orange → green;  blue ← violet;  indigo ↔ indigo ]
%]

It gets rid of the ambiguity when providing an array-ish specification of node names, since you could apparently refer to nodes by their name or their array index. Especially confusing when the node names are numbers. This way, indices for array of nodes, names for dictionary of nodes.

For standard dictionaries, a little bit of syntactic sugar.

Improve tables with row delimiters

Currently, if a list of table features begins with an open bracket ([), then all rows must use bracket delimiters. It seems simpler to never use brackets for the feature list (just terminate at :), while allowing optional row delimiters per row.

Thus, if a row does not begin with [ or { (see issue #20), then all features are collected, and the row terminates at the Nth feature.

If a row begins with [, then collect features in order until ]. Any remaining unspecified features are set according to their defaults.

If a row begins with {, features are assigned by name, as described in #20.

Table schema syntax should use colon instead of tilde

Rather than

[#
    [  id  color~black  gamma~(real:)  length~(int:10)  ]
    ...
#]

it should look like this:

[#
    [  id  color: black  gamma: (real:)  length: (int:10)  ]
    ...
#]

Make graph auto nodes default where sensible

Currently, the nodes specification may take the value auto, indicating that the set of nodes should be implicitly derived from the edge specifications.

Consider making this the default behavior, such that a graph's nodes will be the union of all nodes defined both implicitly (via edge references) or explicitly (in the nodes definition). Nodes defined implicitly will get a data value of null (()).

This may mean that some structures like node arrays may need to robustly handle sparse arrays, so we don't blow up allocation.

Consider changing graph delimiters from `[< >]` to `[% %]`

[# denotes table, and looks like a table if you squint. While squinting, notice that % looks like two nodes an edge. :D

Edges with tabular data

Similar to issue #18 (Nodes with tabular data). Allow edges to use tables to hold their data.

[%
    ... // Node data

    [#
        [ edge      name=(string:"")  color=(color:black) ]:
        [  0 >  2   ~                 red                 ]
        [  0 >  2   "snork"           violet              ]
        [  0 >  2   "gronk"           purple              ]
        [  0 >  2   "burnk"           indigo              ]
    #]
%]

Similar to the case for node tables, LSON could either look for a specially-named feature ("edge" here), or could require the edge to be the first feature of the table. I'm leaning toward the latter.