mozilla / jsonschema-transpiler Goto Github PK

Compile JSON Schema into Avro and BigQuery schemas

License: Mozilla Public License 2.0

Python 3.06% Rust 94.87% Shell 2.07%

jsonschema-transpiler's Issues

Add option to make all fields nullable for BQ schemas

See discussion in mozilla/mozilla-schema-generator#63. For our use case this is probably the best route forward to making schema updates (in particular from schema-less to schema-d) as easy as possible.

Support canonical form for Avro schemas

Gather Data Science feedback on union and tuple types

Wanted: Data science review of query interface

We have a few remaining ambiguous types in JSON schemas that are preventing some important ping fields from appearing as fields in BigQuery. Currently, these ambiguous values end up as part of the additional_properties JSON blob in BigQuery ping tables, so they are available but awkward and potentially expensive to query.

This issue lays out the proposed interface for presenting these as fields. We want to gather feedback now from users of the data, because deploying these changes will be to some extent irreversible; once we coerce these fields to a certain BQ type, we cannot change our minds and choose a different type for an existing field.

For both of these schema issues, we're going to use the event ping to demonstrate.

tl;dr Please play with the query given below in #88 (comment) and leave feedback in this issue about any potential gotchas or improvements you'd like to see with the proposed transformations.

`build.rs` causes `cargo publish` to fail

error: failed to verify package tarball

Caused by:
  Source directory was modified by build.rs during cargo publish. Build scripts should not modify anything outside of OUT_DIR.

The build.rs script is used to format testing resources and to generate tests. This causes issues with cargo publish, which must be run with the --no-verify flag.

Support PubSub schemas

PubSub has the SQL capabilities via Apache Calcite and Beam: https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-ui-walkthrough#assign-pubsub-schema

The format is in YAML, and fits in the scope of this tool. Here's the example taken from the docs page.

  - column: event_timestamp
    description: Pub/Sub event timestamp
    mode: REQUIRED
    type: TIMESTAMP
  - column: attributes
    description: Pub/Sub message attributes
    mode: NULLABLE
    type: MAP<STRING,STRING>
  - column: payload
    description: Pub/Sub message payload
    mode: NULLABLE
    type: STRUCT
    subcolumns:
    - column: tr_time_str
      description: Transaction time string
      mode: NULLABLE
      type: STRING
    - column: first_name
      description: First name
      mode: NULLABLE
      type: STRING
    - column: last_name
      description: Last name
      mode: NULLABLE
      type: STRING
    - column: city
      description: City
      mode: NULLABLE
      type: STRING
    - column: state
      description: State
      mode: NULLABLE
      type: STRING
    - column: product
      description: Product
      mode: NULLABLE
      type: STRING
    - column: amount
      description: Amount of transaction
      mode: NULLABLE
      type: FLOAT

Support schemas for Protocol Buffers

The spec for protobuf v3 can be found here. It may be useful to derive the expected .proto file from a JSON schema.

Fail mozilla-pipeline-schemas CI if schemas contain nullable array elements

jsonschema can allow array elements to be null, but BigQuery can only make fields REPEATED or NULLABLE.

BigQuery parquet imports solve this by wrapping both the array and elements in structs, so that an array is transformed into a struct with one repeated field called list containing structs with one field element and both the outer struct and the element field can be NULLABLE while list is REPEATED.

The transpiler currently converts an array of nullable elements in a jsonschema to a REPEATED field in a BigQuery schema which cannot contain NULL. For example {"properties":{"mylist":{"items":{"type":["integer","null"]},"type":"array"}},"type":"object"} -> [{"mode":"REPEATED","name":"mylist","type":"INT64"}]. This causes issues where if a jsonschema allows a message that BigQuery rejects during a file load operation, the whole file is rejected.

This was discussed in the GCP Technical check-in on 2019-09-30 where it was determined that at this time due to backwards compatibility constraints the transpiler should error if schemas allow nullable array elements and mozilla-pipeline-schemas CI should fail if the transpiler can't transform schemas.

Support `date-time` format in JSON schemas

Here's en example:

fbertsch-23817:sandbox frankbertsch$ echo '{"type": "date-time"}' > datetime.schema.json
fbertsch-23817:sandbox frankbertsch$ jsonschema-transpiler --type bigquery datetime.schema.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("unknown variant `date-time`, expected one of `null`, `boolean`, `number`, `integer`, `string`, `object`, `array`", line: 0, column: 0)', src/libcore/result.rs:1009:5

`<unknown>` property field is dropped from an object

See mozilla-services/mozilla-pipeline-schemas#565 (comment)

echo '{"properties": {"payload": {"properties": {"<unknown>": {"type": "string"}, "foo": {"type": "string"}}}}}' | jsonschema-transpiler -t bigquery
[
  {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "foo",
        "type": "STRING"
      }
    ],
    "mode": "NULLABLE",
    "name": "payload",
    "type": "RECORD"
  }
]

This should map $.properties.payload.properties.<unknown> to a field like __unknown__

Refactor `scripts/format-tests.py` into `build.rs`

The format-tests script is responsible for formatting and sorting the test cases under tests/resources. The build script is always run on tests, while the format script can be ignored.

The logic is straightforward. In a closely related task, it may be useful to add sorting capabilities to the tests. The ordering matters in the presentation of the unit tests.

Support bytes as a data type

There's some use of the bytes SQL type for storing arbitrary binary data. bytes are generally not well formed in JSON, but supported in Avro and BigQuery. A low-impact solution is to create a custom bytes format under the string type, as follows:

{
  "type": "string",
  "format": "bytes"
}

This schema isn't used to validate the payload because documents containing binary data may also contain control characters that invalidate the JSON documents. Instead, the schema is descriptive and used to generate Avro/BigQuery schemas instead.

See:
https://json-schema.org/understanding-json-schema/reference/string.html#format
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#bytes-type

Decide how to handle union types (like [string, int]) in BQ

Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.

Options for how to express in BQ

One option is to default to string in this case, so 4 and "4" in the JSON both become string 4. In this case, we lose information about what the original type was, but that doesn't seem terribly important.

A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that 4 in JSON becomes string 4 and "4" in JSON becomes string"4" with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.

Another option is to turn this into a STRUCT<int INT64, string STRING> where only one of the values will be non-null.

Options for naming the fields

I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.

For example, if we decided to use a struct, we could change field field_name to field_name_struct in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type of field_name, necessitating recreating tables.

cc @acmiyaguchi

Implement `From<ast::AST> for BigQuery`

Implement grammar for Avro

Support tuple validation in jsonschema arrays

In mps:
https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/63028f589c9819f03c0c71d642517f0f22953cb5/validation/pocket/fire-tv-events.1.sample.pass.json#L12-L17

is validated by this:

https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/63028f589c9819f03c0c71d642517f0f22953cb5/schemas/pocket/fire-tv-events/fire-tv-events.1.schema.json#L10-L50

This currently generates the following error:
thread 'main' panicked at 'called Result::unwrap()on anErr value: Error("invalid type: sequence, expected struct Tag", line: 0, column: 0)', src/libcore/result.rs:1009:5

See: https://json-schema.org/understanding-json-schema/reference/array.html#tuple-validation

[Proposal] Add python bindings

Hello,

I thought it would be nice to be able to call this tool from python, so I wrote a binding using PyO3.
https://github.com/kitagawa-hr/jsonschema-transpiler/tree/main/bindings/python

May I send a PR for it?

Thank you.

Add error handling in jsonschema

https://github.com/acmiyaguchi/jsonschema-transpiler/blob/d39b9e7cede239cbe1e83043d451a00de861e4d3/src/jsonschema.rs#L113

The transformation from jsonschema into the ast should provide better error handling for schemas that are not valid. There are a few places where the code will panic instead of propagating useful information up. A Result<T, E> type is appropriate since error messages can help users figure out why their schemas are bad.

Implement `Into<ast::AST> for JSONSchema`

Publish package to crates.io

Add `Union` type to ast

Add integration tests against reference data

The test suite currently has a hand-curated set of expected outputs for the schemas. However, the resulting schema may not be valid against certain edge-cases. For each of the test cases under tests/resources/, there should be a json document that passes validation. This can be used for validating that transformations are working correctly.

This testing script was useful for validating sampled data from the aws pipeline against mozilla-pipeline-schemas.

jsonschema-transpiler/scripts/mps-generate-avro-data-helper.py

Lines 35 to 77 in a7ab358

 def convert(data, schema): 

 if schema.type == "string": 

 if not isinstance(data, str): 

 return json.dumps(data) 

 if schema.type == "record": 

 # iterate over all keys 

 out = {} 

 if not data: 

 return out 

 for key, value in data.items(): 

 # apply the appropriate transformations on the key 

 key = format_key(key) 

 field = schema.field_map.get(key) 

 if not field: 

 continue 

 out[key] = convert(value, field.type) 

 return out 

 if schema.type == "union": 

 for sub in schema.schemas: 

 if sub.type == "null": 

 continue 

 out = convert(data, sub) 

 return out 

 if schema.type == "array": 

 out = [] 

 if not data: 

 return out 

 for item in data: 

 out.append(convert(item, schema.items)) 

 return out 

 if schema.type == "map": 

 out = {} 

 for key, value in data.items(): 

 out[key] = convert(value, schema.values) 

 return out 

 # terminal node, do nothing 

 return data

Provide an error handling mechanism to drop, cast, or panic on under-specified fields

As per mozilla/mozilla-schema-generator#25 (comment):

{
    "metrics": {
        "type": "object",
        "additionalProperties": False
    }
}

is casted into a JSON blob type in BigQuery.

[
  {
    "mode": "NULLABLE",
    "name": "metrics",
    "type": "STRING"
  }
]

Instead of this, the output could be an empty struct. In general, an empty object could be represented as an empty struct. This option would work in BigQuery since maps are an extension of the struct type, but this could cause ambiguity in the avro/parquet representation.

Add binary releases on github tags

Top-level BigQuery schema should be a JSON list

https://cloud.google.com/bigquery/docs/schemas#creating_a_JSON_schema_file

Consider supporting legacy BigQuery types

For example, we use BOOL but the canonical form is BOOLEAN. This causes some headache for testing if there has been a change in the file.

Add continuous integration for tests and linting

Run all unit and integration tests
Check for formatting and linting errors
- cargo fmt
- cargo clippy

Add an option to drop incompatible fields as an error-handling mechanism

Support serde of Avro IDL

The Avro IDL format is much easier for humans to read than JSON. However, it can be thought of as syntactic sugar for the JSON format.

`--resolve drop` fails on simple schema with incompatible multi-typed items

For example:

{
  "type": "object",
  "properties": {
    "slices": {
      "type": ["array", "number"],
      "items": {"type": "string"}
    }
  }
}

fails with: thread 'main' panicked at 'called Result::unwrap()on anErr value: "__unknown__ - empty object"', src/libcore/result.rs:1009:5

Implement `Into<ast::AST> for Avro`

doc.rs fails to build as of v1.6.0

See https://docs.rs/crate/jsonschema-transpiler/1.8.0 for details:

Use `TIMESTAMP` instead of `DATETIME` in BigQuery

The two formats are identical, except only the former can be partitioned on.

Implement simplification of Union types

Support schemas for Spark Dataframes

See: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

The schemas can be loaded from a json file.

Propagate "description" from json schema to BigQuery schema

For json schemas that include "description" attributes for some fields, we should include those descriptions in the produced BigQuery schemas so they are present when browsing schemas in the BQ console.

Prepare readme for release

Related: #35

IMO we should have the following information in the readme:

What is it, where is it used
installation
Usage examples
How to contribute
License & CoC

The new docs from #44 could move into its own docs (docs/development.md) to not blow up the readme too much.

What do you think?

Add `Intersection` type to ast

Union of string and bytes should be bytes

jsonschema-transpiler/src/ast.rs

Lines 137 to 145 in 536c6fd

 (Type::Atom(left), Type::Atom(right)) => { 

 let atom = match (left, right) { 

 (Atom::Boolean, Atom::Boolean) => Atom::Boolean, 

 (Atom::Integer, Atom::Integer) => Atom::Integer, 

 (Atom::Number, Atom::Number) 

 | (Atom::Integer, Atom::Number) 

 | (Atom::Number, Atom::Integer) => Atom::Number, 

 (Atom::String, Atom::String) => Atom::String, 

 (lhs, rhs) => {

The union of a string and bytes should probably be bytes, for now it'll be dropped or casted into a string.

{"oneOf": [{"type": "string"}, {"type": "string", "format": "bytes"}]}

Originally posted by @acmiyaguchi in #82 (comment)

Convert camel case bigquery column names to snake case

See mozilla/gcp-ingestion#671

The direct to parquet datasets coerce camel case keys to snake case, but right now our pipeline of pings into bigquery does not.

I think this consistent naming would be desirable and it would be best/simplest to handle it in the pipeline rather than deferring to views. This would require a coordinate change in the schema transpiler and in the BigQuery sink dataflow jobs.

Refactor transpilation into a state-machine

The API for this application is simple. To implement a new schema target, the following process is done:

Define a type (i.e. a serde annotated data-structure) that describes the schema format
Implement Into from ast::Tag into the target type

The Into trait is limited because it assumes that the conversion will succeed or panic. A panic within the into function fails quickly, but does not provide very much context. In #34, we would like to have proper error handling using the Result type.

A finite state machine (FSM) would provide a good interface for error handling. We should implement the following functions:

A goto function to determine the next state - effectively an iterator defaulting to a depth-first traversal of the schema
A failure function to generate an error result and indicate a possibility for backtracking

A single main function can then apply the goto functions until it reaches a successful state or collect (all possible) failures and surface them to the user.

Add documentation around the definition of the `Tag` and `Type` pattern

Validate BigQuery schemas by round-trip via Avro

With support for data in integration tests in #62, we can start to validate BigQuery schemas against the BigQuery itself. It should be done in the following way:

Generate an avro file
bq load

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#bigquery-import-gcs-file-cli

bq show --format=prettyjson
Generate a diff between the two formats
File issues as appropriate

Support schemas for Parquet-MR

The grammar can specified by following the Parquet Logical Type Definitions document.

The parquet format is implemented in java in apache/parquet-mr. This schema is used throughout the mozilla-pipeline-schemas repo, which means a large number of documents are available for testing.

	def convert(data, schema):

	if schema.type == "string":
	if not isinstance(data, str):
	return json.dumps(data)

	if schema.type == "record":
	# iterate over all keys
	out = {}
	if not data:
	return out
	for key, value in data.items():
	# apply the appropriate transformations on the key
	key = format_key(key)
	field = schema.field_map.get(key)
	if not field:
	continue
	out[key] = convert(value, field.type)
	return out

	if schema.type == "union":
	for sub in schema.schemas:
	if sub.type == "null":
	continue
	out = convert(data, sub)
	return out

	if schema.type == "array":
	out = []
	if not data:
	return out
	for item in data:
	out.append(convert(item, schema.items))
	return out

	if schema.type == "map":
	out = {}
	for key, value in data.items():
	out[key] = convert(value, schema.values)
	return out

	# terminal node, do nothing
	return data

	(Type::Atom(left), Type::Atom(right)) => {
	let atom = match (left, right) {
	(Atom::Boolean, Atom::Boolean) => Atom::Boolean,
	(Atom::Integer, Atom::Integer) => Atom::Integer,
	(Atom::Number, Atom::Number)
	\| (Atom::Integer, Atom::Number)
	\| (Atom::Number, Atom::Integer) => Atom::Number,
	(Atom::String, Atom::String) => Atom::String,
	(lhs, rhs) => {

mozilla / jsonschema-transpiler Goto Github PK

jsonschema-transpiler's Issues

Wanted: Data science review of query interface

Options for how to express in BQ

Options for naming the fields

Recommend Projects

Recommend Topics

Recommend Org