Giter VIP home page Giter VIP logo

jsonschema-transpiler's Introduction

jsonschema-transpiler

CircleCI

A tool for transpiling JSON Schema into schemas for Avro and BigQuery.

JSON Schema is primarily used to validate incoming data, but contains enough information to describe the structure of the data. The transpiler encodes the schema for use with data serialization and processing frameworks. The main use-case is to enable ingestion of JSON documents into BigQuery through an Avro intermediary.

This tool can handle many of the composite types seen in modern data processing tools that support a SQL interface such as lists, structures, key-value maps, and type-variants.

This tool is designed for generating new schemas from mozilla-pipeline-schemas, the canonical source of truth for JSON schemas in the Firefox Data Platform.

Installation

cargo install jsonschema-transpiler

Usage

A tool to transpile JSON Schema into schemas for data processing

USAGE:
    jsonschema-transpiler [FLAGS] [OPTIONS] [file]

FLAGS:
    -w, --allow-maps-without-value    Produces maps without a value field for incompatible or under-specified value
                                      schema
    -n, --force-nullable              Treats all columns as NULLABLE, ignoring the required section in the JSON Schema
                                      object
    -h, --help                        Prints help information
    -c, --normalize-case              snake_case column-names for consistent behavior between SQL engines
        --tuple-struct                Treats tuple validation as an anonymous struct
    -V, --version                     Prints version information

OPTIONS:
    -r, --resolve <resolve>    The resolution strategy for incompatible or under-specified schema [default: cast]
                               [possible values: cast, panic, drop]
    -t, --type <type>          The output schema format [default: avro]  [possible values: avro, bigquery]

ARGS:
    <file>    Sets the input file to use

JSON Schemas can be read from stdin or from a file.

Examples usage

# An object with a single, optional boolean field
$ schema='{"type": "object", "properties": {"foo": {"type": "boolean"}}}'

$ echo $schema | jq
{
  "type": "object",
  "properties": {
    "foo": {
      "type": "boolean"
    }
  }
}

$ echo $schema | jsonschema-transpiler --type avro
{
  "fields": [
    {
      "default": null,
      "name": "foo",
      "type": [
        {
          "type": "null"
        },
        {
          "type": "boolean"
        }
      ]
    }
  ],
  "name": "root",
  "type": "record"
}

$ echo $schema | jsonschema-transpiler --type bigquery
[
  {
    "mode": "NULLABLE",
    "name": "foo",
    "type": "BOOL"
  }
]

Building

To build and test the package:

cargo build
cargo test

Older versions of the package (<= 1.9) relied on the use of oniguruma for performing snake-casing logic. To enable the use of this module, add a feature flag:

cargo test --features oniguruma

Contributing

Contributions are welcome. The API may change significantly, but the transformation between various source formats should remain consistent. To aid in the development of the transpiler, tests cases are generated from a language agnostic format under tests/resources.

{
    "name": "test-suite",
    "tests": [
        {
            "name": "test-case",
            "description": [
                "A short description of the test case."
            ],
            "tests": {
                "avro": {...},
                "bigquery": {...},
                "json": {...}
            }
        },
        ...
    ]
}

Schemas provide a type system for data-structures. Most schema languages support a similar set of primitives. There are atomic data types like booleans, integers, and floats. These atomic data types can form compound units of structure, such as objects, arrays, and maps. The absence of a value is usually denoted by a null type. There are type modifiers, like the union of two types.

The following schemas are currently supported:

  • JSON Schema
  • Avro
  • BigQuery

In the future, it may be possible to support schemas from similar systems like Parquet and Spark, or into various interactive data languages (IDL) like Avro IDL.

Publishing

The jsonschema-transpiler is distributed as a crate via Cargo. Follow this checklist for deploying to crates.io.

  1. Bump the version number in the Cargo.toml, as per Semantic Versioning.
  2. Double check that cargo test and CI succeeds.
  3. Run cargo publish. It must be run with the --no-verify flag due to issue #59.
  4. Draft a new release in GitHub corresponding with the version bump.

jsonschema-transpiler's People

Contributors

acmiyaguchi avatar badboy avatar fbertsch avatar kik-kik avatar relud avatar scholtzan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jsonschema-transpiler's Issues

Support PubSub schemas

PubSub has the SQL capabilities via Apache Calcite and Beam: https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-ui-walkthrough#assign-pubsub-schema

The format is in YAML, and fits in the scope of this tool. Here's the example taken from the docs page.

  - column: event_timestamp
    description: Pub/Sub event timestamp
    mode: REQUIRED
    type: TIMESTAMP
  - column: attributes
    description: Pub/Sub message attributes
    mode: NULLABLE
    type: MAP<STRING,STRING>
  - column: payload
    description: Pub/Sub message payload
    mode: NULLABLE
    type: STRUCT
    subcolumns:
    - column: tr_time_str
      description: Transaction time string
      mode: NULLABLE
      type: STRING
    - column: first_name
      description: First name
      mode: NULLABLE
      type: STRING
    - column: last_name
      description: Last name
      mode: NULLABLE
      type: STRING
    - column: city
      description: City
      mode: NULLABLE
      type: STRING
    - column: state
      description: State
      mode: NULLABLE
      type: STRING
    - column: product
      description: Product
      mode: NULLABLE
      type: STRING
    - column: amount
      description: Amount of transaction
      mode: NULLABLE
      type: FLOAT

Support `date-time` format in JSON schemas

Here's en example:

fbertsch-23817:sandbox frankbertsch$ echo '{"type": "date-time"}' > datetime.schema.json
fbertsch-23817:sandbox frankbertsch$ jsonschema-transpiler --type bigquery datetime.schema.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("unknown variant `date-time`, expected one of `null`, `boolean`, `number`, `integer`, `string`, `object`, `array`", line: 0, column: 0)', src/libcore/result.rs:1009:5

Refactor `scripts/format-tests.py` into `build.rs`

The format-tests script is responsible for formatting and sorting the test cases under tests/resources. The build script is always run on tests, while the format script can be ignored.

The logic is straightforward. In a closely related task, it may be useful to add sorting capabilities to the tests. The ordering matters in the presentation of the unit tests.

Support tuple validation in jsonschema arrays

Provide an error handling mechanism to drop, cast, or panic on under-specified fields

As per mozilla/mozilla-schema-generator#25 (comment):

{
    "metrics": {
        "type": "object",
        "additionalProperties": False
    }
}

is casted into a JSON blob type in BigQuery.

[
  {
    "mode": "NULLABLE",
    "name": "metrics",
    "type": "STRING"
  }
]

Instead of this, the output could be an empty struct. In general, an empty object could be represented as an empty struct. This option would work in BigQuery since maps are an extension of the struct type, but this could cause ambiguity in the avro/parquet representation.

Gather Data Science feedback on union and tuple types

Wanted: Data science review of query interface

We have a few remaining ambiguous types in JSON schemas that are preventing some important ping fields from appearing as fields in BigQuery. Currently, these ambiguous values end up as part of the additional_properties JSON blob in BigQuery ping tables, so they are available but awkward and potentially expensive to query.

This issue lays out the proposed interface for presenting these as fields. We want to gather feedback now from users of the data, because deploying these changes will be to some extent irreversible; once we coerce these fields to a certain BQ type, we cannot change our minds and choose a different type for an existing field.

For both of these schema issues, we're going to use the event ping to demonstrate.

tl;dr Please play with the query given below in #88 (comment) and leave feedback in this issue about any potential gotchas or improvements you'd like to see with the proposed transformations.

Support bytes as a data type

There's some use of the bytes SQL type for storing arbitrary binary data. bytes are generally not well formed in JSON, but supported in Avro and BigQuery. A low-impact solution is to create a custom bytes format under the string type, as follows:

{
  "type": "string",
  "format": "bytes"
}

This schema isn't used to validate the payload because documents containing binary data may also contain control characters that invalidate the JSON documents. Instead, the schema is descriptive and used to generate Avro/BigQuery schemas instead.

See:
https://json-schema.org/understanding-json-schema/reference/string.html#format
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#bytes-type

Decide how to handle union types (like [string, int]) in BQ

Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.

Options for how to express in BQ

One option is to default to string in this case, so 4 and "4" in the JSON both become string 4. In this case, we lose information about what the original type was, but that doesn't seem terribly important.

A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that 4 in JSON becomes string 4 and "4" in JSON becomes string"4" with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.

Another option is to turn this into a STRUCT<int INT64, string STRING> where only one of the values will be non-null.

Options for naming the fields

I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.

For example, if we decided to use a struct, we could change field field_name to field_name_struct in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type of field_name, necessitating recreating tables.

cc @acmiyaguchi

Union of string and bytes should be bytes

(Type::Atom(left), Type::Atom(right)) => {
let atom = match (left, right) {
(Atom::Boolean, Atom::Boolean) => Atom::Boolean,
(Atom::Integer, Atom::Integer) => Atom::Integer,
(Atom::Number, Atom::Number)
| (Atom::Integer, Atom::Number)
| (Atom::Number, Atom::Integer) => Atom::Number,
(Atom::String, Atom::String) => Atom::String,
(lhs, rhs) => {

The union of a string and bytes should probably be bytes, for now it'll be dropped or casted into a string.

{"oneOf": [{"type": "string"}, {"type": "string", "format": "bytes"}]}

Originally posted by @acmiyaguchi in #82 (comment)

`build.rs` causes `cargo publish` to fail

error: failed to verify package tarball

Caused by:
  Source directory was modified by build.rs during cargo publish. Build scripts should not modify anything outside of OUT_DIR.

The build.rs script is used to format testing resources and to generate tests. This causes issues with cargo publish, which must be run with the --no-verify flag.

Refactor transpilation into a state-machine

The API for this application is simple. To implement a new schema target, the following process is done:

  1. Define a type (i.e. a serde annotated data-structure) that describes the schema format
  2. Implement Into from ast::Tag into the target type

The Into trait is limited because it assumes that the conversion will succeed or panic. A panic within the into function fails quickly, but does not provide very much context. In #34, we would like to have proper error handling using the Result type.

A finite state machine (FSM) would provide a good interface for error handling. We should implement the following functions:

  1. A goto function to determine the next state - effectively an iterator defaulting to a depth-first traversal of the schema
  2. A failure function to generate an error result and indicate a possibility for backtracking

A single main function can then apply the goto functions until it reaches a successful state or collect (all possible) failures and surface them to the user.

Add integration tests against reference data

The test suite currently has a hand-curated set of expected outputs for the schemas. However, the resulting schema may not be valid against certain edge-cases. For each of the test cases under tests/resources/, there should be a json document that passes validation. This can be used for validating that transformations are working correctly.

This testing script was useful for validating sampled data from the aws pipeline against mozilla-pipeline-schemas.

def convert(data, schema):
if schema.type == "string":
if not isinstance(data, str):
return json.dumps(data)
if schema.type == "record":
# iterate over all keys
out = {}
if not data:
return out
for key, value in data.items():
# apply the appropriate transformations on the key
key = format_key(key)
field = schema.field_map.get(key)
if not field:
continue
out[key] = convert(value, field.type)
return out
if schema.type == "union":
for sub in schema.schemas:
if sub.type == "null":
continue
out = convert(data, sub)
return out
if schema.type == "array":
out = []
if not data:
return out
for item in data:
out.append(convert(item, schema.items))
return out
if schema.type == "map":
out = {}
for key, value in data.items():
out[key] = convert(value, schema.values)
return out
# terminal node, do nothing
return data

Convert camel case bigquery column names to snake case

See mozilla/gcp-ingestion#671

The direct to parquet datasets coerce camel case keys to snake case, but right now our pipeline of pings into bigquery does not.

I think this consistent naming would be desirable and it would be best/simplest to handle it in the pipeline rather than deferring to views. This would require a coordinate change in the schema transpiler and in the BigQuery sink dataflow jobs.

Prepare readme for release

Related: #35

IMO we should have the following information in the readme:

  1. What is it, where is it used
  2. installation
  3. Usage examples
  4. How to contribute
  5. License & CoC

The new docs from #44 could move into its own docs (docs/development.md) to not blow up the readme too much.

What do you think?

Fail mozilla-pipeline-schemas CI if schemas contain nullable array elements

jsonschema can allow array elements to be null, but BigQuery can only make fields REPEATED or NULLABLE.

BigQuery parquet imports solve this by wrapping both the array and elements in structs, so that an array is transformed into a struct with one repeated field called list containing structs with one field element and both the outer struct and the element field can be NULLABLE while list is REPEATED.

The transpiler currently converts an array of nullable elements in a jsonschema to a REPEATED field in a BigQuery schema which cannot contain NULL. For example {"properties":{"mylist":{"items":{"type":["integer","null"]},"type":"array"}},"type":"object"} -> [{"mode":"REPEATED","name":"mylist","type":"INT64"}]. This causes issues where if a jsonschema allows a message that BigQuery rejects during a file load operation, the whole file is rejected.

This was discussed in the GCP Technical check-in on 2019-09-30 where it was determined that at this time due to backwards compatibility constraints the transpiler should error if schemas allow nullable array elements and mozilla-pipeline-schemas CI should fail if the transpiler can't transform schemas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.