mozilla / jsonschema-transpiler Goto Github PK
View Code? Open in Web Editor NEWCompile JSON Schema into Avro and BigQuery schemas
License: Mozilla Public License 2.0
Compile JSON Schema into Avro and BigQuery schemas
License: Mozilla Public License 2.0
As per mozilla/mozilla-schema-generator#25 (comment):
{
"metrics": {
"type": "object",
"additionalProperties": False
}
}
is casted into a JSON blob type in BigQuery.
[
{
"mode": "NULLABLE",
"name": "metrics",
"type": "STRING"
}
]
Instead of this, the output could be an empty struct. In general, an empty object could be represented as an empty struct. This option would work in BigQuery since maps are an extension of the struct type, but this could cause ambiguity in the avro/parquet representation.
See https://docs.rs/crate/jsonschema-transpiler/1.8.0 for details:
The test suite currently has a hand-curated set of expected outputs for the schemas. However, the resulting schema may not be valid against certain edge-cases. For each of the test cases under tests/resources/
, there should be a json document that passes validation. This can be used for validating that transformations are working correctly.
This testing script was useful for validating sampled data from the aws pipeline against mozilla-pipeline-schemas.
jsonschema-transpiler/scripts/mps-generate-avro-data-helper.py
Lines 35 to 77 in a7ab358
cargo fmt
cargo clippy
The grammar can specified by following the Parquet Logical Type Definitions document.
The parquet format is implemented in java in apache/parquet-mr. This schema is used throughout the mozilla-pipeline-schemas repo, which means a large number of documents are available for testing.
With support for data in integration tests in #62, we can start to validate BigQuery schemas against the BigQuery itself. It should be done in the following way:
bq load
bq show --format=prettyjson
The spec for protobuf v3 can be found here. It may be useful to derive the expected .proto
file from a JSON schema.
is validated by this:
This currently generates the following error:
thread 'main' panicked at 'called
Result::unwrap()on an
Err value: Error("invalid type: sequence, expected struct Tag", line: 0, column: 0)', src/libcore/result.rs:1009:5
See: https://json-schema.org/understanding-json-schema/reference/array.html#tuple-validation
See discussion in mozilla/mozilla-schema-generator#63. For our use case this is probably the best route forward to making schema updates (in particular from schema-less to schema-d) as easy as possible.
The format-tests script is responsible for formatting and sorting the test cases under tests/resources
. The build script is always run on tests, while the format script can be ignored.
The logic is straightforward. In a closely related task, it may be useful to add sorting capabilities to the tests. The ordering matters in the presentation of the unit tests.
We have a few remaining ambiguous types in JSON schemas that are preventing some important ping fields from appearing as fields in BigQuery. Currently, these ambiguous values end up as part of the additional_properties
JSON blob in BigQuery ping tables, so they are available but awkward and potentially expensive to query.
This issue lays out the proposed interface for presenting these as fields. We want to gather feedback now from users of the data, because deploying these changes will be to some extent irreversible; once we coerce these fields to a certain BQ type, we cannot change our minds and choose a different type for an existing field.
For both of these schema issues, we're going to use the event
ping to demonstrate.
tl;dr Please play with the query given below in #88 (comment) and leave feedback in this issue about any potential gotchas or improvements you'd like to see with the proposed transformations.
Here's en example:
fbertsch-23817:sandbox frankbertsch$ echo '{"type": "date-time"}' > datetime.schema.json
fbertsch-23817:sandbox frankbertsch$ jsonschema-transpiler --type bigquery datetime.schema.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("unknown variant `date-time`, expected one of `null`, `boolean`, `number`, `integer`, `string`, `object`, `array`", line: 0, column: 0)', src/libcore/result.rs:1009:5
jsonschema-transpiler/src/ast.rs
Lines 137 to 145 in 536c6fd
The union of a string
and bytes
should probably be bytes
, for now it'll be dropped or casted into a string.
{"oneOf": [{"type": "string"}, {"type": "string", "format": "bytes"}]}
Originally posted by @acmiyaguchi in #82 (comment)
PubSub has the SQL capabilities via Apache Calcite and Beam: https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-ui-walkthrough#assign-pubsub-schema
The format is in YAML, and fits in the scope of this tool. Here's the example taken from the docs page.
- column: event_timestamp
description: Pub/Sub event timestamp
mode: REQUIRED
type: TIMESTAMP
- column: attributes
description: Pub/Sub message attributes
mode: NULLABLE
type: MAP<STRING,STRING>
- column: payload
description: Pub/Sub message payload
mode: NULLABLE
type: STRUCT
subcolumns:
- column: tr_time_str
description: Transaction time string
mode: NULLABLE
type: STRING
- column: first_name
description: First name
mode: NULLABLE
type: STRING
- column: last_name
description: Last name
mode: NULLABLE
type: STRING
- column: city
description: City
mode: NULLABLE
type: STRING
- column: state
description: State
mode: NULLABLE
type: STRING
- column: product
description: Product
mode: NULLABLE
type: STRING
- column: amount
description: Amount of transaction
mode: NULLABLE
type: FLOAT
For example:
{
"type": "object",
"properties": {
"slices": {
"type": ["array", "number"],
"items": {"type": "string"}
}
}
}
fails with: thread 'main' panicked at 'called
Result::unwrap()on an
Err value: "__unknown__ - empty object"', src/libcore/result.rs:1009:5
The direct to parquet datasets coerce camel case keys to snake case, but right now our pipeline of pings into bigquery does not.
I think this consistent naming would be desirable and it would be best/simplest to handle it in the pipeline rather than deferring to views. This would require a coordinate change in the schema transpiler and in the BigQuery sink dataflow jobs.
Hello,
I thought it would be nice to be able to call this tool from python, so I wrote a binding using PyO3.
https://github.com/kitagawa-hr/jsonschema-transpiler/tree/main/bindings/python
May I send a PR for it?
Thank you.
The two formats are identical, except only the former can be partitioned on.
There's some use of the bytes
SQL type for storing arbitrary binary data. bytes
are generally not well formed in JSON, but supported in Avro and BigQuery. A low-impact solution is to create a custom bytes
format under the string
type, as follows:
{
"type": "string",
"format": "bytes"
}
This schema isn't used to validate the payload because documents containing binary data may also contain control characters that invalidate the JSON documents. Instead, the schema is descriptive and used to generate Avro/BigQuery schemas instead.
See:
https://json-schema.org/understanding-json-schema/reference/string.html#format
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#bytes-type
The API for this application is simple. To implement a new schema target, the following process is done:
The Into trait is limited because it assumes that the conversion will succeed or panic. A panic within the into function fails quickly, but does not provide very much context. In #34, we would like to have proper error handling using the Result
type.
A finite state machine (FSM) would provide a good interface for error handling. We should implement the following functions:
A single main function can then apply the goto functions until it reaches a successful state or collect (all possible) failures and surface them to the user.
The Avro IDL format is much easier for humans to read than JSON. However, it can be thought of as syntactic sugar for the JSON format.
jsonschema can allow array elements to be null
, but BigQuery can only make fields REPEATED
or NULLABLE
.
BigQuery parquet imports solve this by wrapping both the array and elements in structs, so that an array is transformed into a struct with one repeated field called list
containing structs with one field element
and both the outer struct and the element
field can be NULLABLE
while list
is REPEATED
.
The transpiler currently converts an array of nullable elements in a jsonschema to a REPEATED
field in a BigQuery schema which cannot contain NULL
. For example {"properties":{"mylist":{"items":{"type":["integer","null"]},"type":"array"}},"type":"object"}
-> [{"mode":"REPEATED","name":"mylist","type":"INT64"}]
. This causes issues where if a jsonschema allows a message that BigQuery rejects during a file load operation, the whole file is rejected.
This was discussed in the GCP Technical check-in on 2019-09-30 where it was determined that at this time due to backwards compatibility constraints the transpiler should error if schemas allow nullable array elements and mozilla-pipeline-schemas CI should fail if the transpiler can't transform schemas.
The schemas can be loaded from a json file.
See mozilla-services/mozilla-pipeline-schemas#565 (comment)
echo '{"properties": {"payload": {"properties": {"<unknown>": {"type": "string"}, "foo": {"type": "string"}}}}}' | jsonschema-transpiler -t bigquery
[
{
"fields": [
{
"mode": "NULLABLE",
"name": "foo",
"type": "STRING"
}
],
"mode": "NULLABLE",
"name": "payload",
"type": "RECORD"
}
]
This should map $.properties.payload.properties.<unknown>
to a field like __unknown__
error: failed to verify package tarball
Caused by:
Source directory was modified by build.rs during cargo publish. Build scripts should not modify anything outside of OUT_DIR.
The build.rs
script is used to format testing resources and to generate tests. This causes issues with cargo publish
, which must be run with the --no-verify
flag.
The transformation from jsonschema into the ast should provide better error handling for schemas that are not valid. There are a few places where the code will panic instead of propagating useful information up. A Result<T, E>
type is appropriate since error messages can help users figure out why their schemas are bad.
Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.
One option is to default to string in this case, so 4
and "4"
in the JSON both become string 4
. In this case, we lose information about what the original type was, but that doesn't seem terribly important.
A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that 4
in JSON becomes string 4
and "4"
in JSON becomes string"4"
with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.
Another option is to turn this into a STRUCT<int INT64, string STRING>
where only one of the values will be non-null.
I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.
For example, if we decided to use a struct, we could change field field_name
to field_name_struct
in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type of field_name
, necessitating recreating tables.
cc @acmiyaguchi
For example, we use BOOL
but the canonical form is BOOLEAN
. This causes some headache for testing if there has been a change in the file.
For json schemas that include "description" attributes for some fields, we should include those descriptions in the produced BigQuery schemas so they are present when browsing schemas in the BQ console.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.