Top-level BigQuery schema should be a JSON list

jsonschema-transpiler

A tool for transpiling JSON Schema into schemas for Avro and BigQuery.

JSON Schema is primarily used to validate incoming data, but contains enough information to describe the structure of the data. The transpiler encodes the schema for use with data serialization and processing frameworks. The main use-case is to enable ingestion of JSON documents into BigQuery through an Avro intermediary.

This tool can handle many of the composite types seen in modern data processing tools that support a SQL interface such as lists, structures, key-value maps, and type-variants.

This tool is designed for generating new schemas from mozilla-pipeline-schemas, the canonical source of truth for JSON schemas in the Firefox Data Platform.

Installation

cargo install jsonschema-transpiler

Usage

A tool to transpile JSON Schema into schemas for data processing

USAGE:
    jsonschema-transpiler [FLAGS] [OPTIONS] [file]

FLAGS:
    -w, --allow-maps-without-value    Produces maps without a value field for incompatible or under-specified value
                                      schema
    -n, --force-nullable              Treats all columns as NULLABLE, ignoring the required section in the JSON Schema
                                      object
    -h, --help                        Prints help information
    -c, --normalize-case              snake_case column-names for consistent behavior between SQL engines
        --tuple-struct                Treats tuple validation as an anonymous struct
    -V, --version                     Prints version information

OPTIONS:
    -r, --resolve <resolve>    The resolution strategy for incompatible or under-specified schema [default: cast]
                               [possible values: cast, panic, drop]
    -t, --type <type>          The output schema format [default: avro]  [possible values: avro, bigquery]

ARGS:
    <file>    Sets the input file to use

JSON Schemas can be read from stdin or from a file.

Examples usage

# An object with a single, optional boolean field
$ schema='{"type": "object", "properties": {"foo": {"type": "boolean"}}}'

$ echo $schema | jq
{
  "type": "object",
  "properties": {
    "foo": {
      "type": "boolean"
    }
  }
}

$ echo $schema | jsonschema-transpiler --type avro
{
  "fields": [
    {
      "default": null,
      "name": "foo",
      "type": [
        {
          "type": "null"
        },
        {
          "type": "boolean"
        }
      ]
    }
  ],
  "name": "root",
  "type": "record"
}

$ echo $schema | jsonschema-transpiler --type bigquery
[
  {
    "mode": "NULLABLE",
    "name": "foo",
    "type": "BOOL"
  }
]

Building

To build and test the package:

cargo build
cargo test

Older versions of the package (<= 1.9) relied on the use of oniguruma for performing snake-casing logic. To enable the use of this module, add a feature flag:

cargo test --features oniguruma

Contributing

Contributions are welcome. The API may change significantly, but the transformation between various source formats should remain consistent. To aid in the development of the transpiler, tests cases are generated from a language agnostic format under tests/resources.

{
    "name": "test-suite",
    "tests": [
        {
            "name": "test-case",
            "description": [
                "A short description of the test case."
            ],
            "tests": {
                "avro": {...},
                "bigquery": {...},
                "json": {...}
            }
        },
        ...
    ]
}

Schemas provide a type system for data-structures. Most schema languages support a similar set of primitives. There are atomic data types like booleans, integers, and floats. These atomic data types can form compound units of structure, such as objects, arrays, and maps. The absence of a value is usually denoted by a null type. There are type modifiers, like the union of two types.

The following schemas are currently supported:

JSON Schema
Avro
BigQuery

In the future, it may be possible to support schemas from similar systems like Parquet and Spark, or into various interactive data languages (IDL) like Avro IDL.

Publishing

The jsonschema-transpiler is distributed as a crate via Cargo. Follow this checklist for deploying to crates.io.

Bump the version number in the Cargo.toml, as per Semantic Versioning.
Double check that cargo test and CI succeeds.
Run cargo publish. It must be run with the --no-verify flag due to issue #59.
Draft a new release in GitHub corresponding with the version bump.

	(Type::Atom(left), Type::Atom(right)) => {
	let atom = match (left, right) {
	(Atom::Boolean, Atom::Boolean) => Atom::Boolean,
	(Atom::Integer, Atom::Integer) => Atom::Integer,
	(Atom::Number, Atom::Number)
	\| (Atom::Integer, Atom::Number)
	\| (Atom::Number, Atom::Integer) => Atom::Number,
	(Atom::String, Atom::String) => Atom::String,
	(lhs, rhs) => {

	def convert(data, schema):

	if schema.type == "string":
	if not isinstance(data, str):
	return json.dumps(data)

	if schema.type == "record":
	# iterate over all keys
	out = {}
	if not data:
	return out
	for key, value in data.items():
	# apply the appropriate transformations on the key
	key = format_key(key)
	field = schema.field_map.get(key)
	if not field:
	continue
	out[key] = convert(value, field.type)
	return out

	if schema.type == "union":
	for sub in schema.schemas:
	if sub.type == "null":
	continue
	out = convert(data, sub)
	return out

	if schema.type == "array":
	out = []
	if not data:
	return out
	for item in data:
	out.append(convert(item, schema.items))
	return out

	if schema.type == "map":
	out = {}
	for key, value in data.items():
	out[key] = convert(value, schema.values)
	return out

	# terminal node, do nothing
	return data

mozilla / jsonschema-transpiler Goto Github PK

jsonschema-transpiler's Introduction

jsonschema-transpiler

Installation

Usage

Examples usage

Building

Contributing

Publishing

jsonschema-transpiler's People

Contributors

Stargazers

Watchers

Forkers

jsonschema-transpiler's Issues

Wanted: Data science review of query interface

Options for how to express in BQ

Options for naming the fields

Recommend Projects

Recommend Topics

Recommend Org