Giter VIP home page Giter VIP logo

jsonschema-transpiler's Issues

Provide an error handling mechanism to drop, cast, or panic on under-specified fields

As per mozilla/mozilla-schema-generator#25 (comment):

{
    "metrics": {
        "type": "object",
        "additionalProperties": False
    }
}

is casted into a JSON blob type in BigQuery.

[
  {
    "mode": "NULLABLE",
    "name": "metrics",
    "type": "STRING"
  }
]

Instead of this, the output could be an empty struct. In general, an empty object could be represented as an empty struct. This option would work in BigQuery since maps are an extension of the struct type, but this could cause ambiguity in the avro/parquet representation.

Prepare readme for release

Related: #35

IMO we should have the following information in the readme:

  1. What is it, where is it used
  2. installation
  3. Usage examples
  4. How to contribute
  5. License & CoC

The new docs from #44 could move into its own docs (docs/development.md) to not blow up the readme too much.

What do you think?

Add integration tests against reference data

The test suite currently has a hand-curated set of expected outputs for the schemas. However, the resulting schema may not be valid against certain edge-cases. For each of the test cases under tests/resources/, there should be a json document that passes validation. This can be used for validating that transformations are working correctly.

This testing script was useful for validating sampled data from the aws pipeline against mozilla-pipeline-schemas.

def convert(data, schema):
if schema.type == "string":
if not isinstance(data, str):
return json.dumps(data)
if schema.type == "record":
# iterate over all keys
out = {}
if not data:
return out
for key, value in data.items():
# apply the appropriate transformations on the key
key = format_key(key)
field = schema.field_map.get(key)
if not field:
continue
out[key] = convert(value, field.type)
return out
if schema.type == "union":
for sub in schema.schemas:
if sub.type == "null":
continue
out = convert(data, sub)
return out
if schema.type == "array":
out = []
if not data:
return out
for item in data:
out.append(convert(item, schema.items))
return out
if schema.type == "map":
out = {}
for key, value in data.items():
out[key] = convert(value, schema.values)
return out
# terminal node, do nothing
return data

Support tuple validation in jsonschema arrays

Refactor `scripts/format-tests.py` into `build.rs`

The format-tests script is responsible for formatting and sorting the test cases under tests/resources. The build script is always run on tests, while the format script can be ignored.

The logic is straightforward. In a closely related task, it may be useful to add sorting capabilities to the tests. The ordering matters in the presentation of the unit tests.

Gather Data Science feedback on union and tuple types

Wanted: Data science review of query interface

We have a few remaining ambiguous types in JSON schemas that are preventing some important ping fields from appearing as fields in BigQuery. Currently, these ambiguous values end up as part of the additional_properties JSON blob in BigQuery ping tables, so they are available but awkward and potentially expensive to query.

This issue lays out the proposed interface for presenting these as fields. We want to gather feedback now from users of the data, because deploying these changes will be to some extent irreversible; once we coerce these fields to a certain BQ type, we cannot change our minds and choose a different type for an existing field.

For both of these schema issues, we're going to use the event ping to demonstrate.

tl;dr Please play with the query given below in #88 (comment) and leave feedback in this issue about any potential gotchas or improvements you'd like to see with the proposed transformations.

Support `date-time` format in JSON schemas

Here's en example:

fbertsch-23817:sandbox frankbertsch$ echo '{"type": "date-time"}' > datetime.schema.json
fbertsch-23817:sandbox frankbertsch$ jsonschema-transpiler --type bigquery datetime.schema.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("unknown variant `date-time`, expected one of `null`, `boolean`, `number`, `integer`, `string`, `object`, `array`", line: 0, column: 0)', src/libcore/result.rs:1009:5

Union of string and bytes should be bytes

(Type::Atom(left), Type::Atom(right)) => {
let atom = match (left, right) {
(Atom::Boolean, Atom::Boolean) => Atom::Boolean,
(Atom::Integer, Atom::Integer) => Atom::Integer,
(Atom::Number, Atom::Number)
| (Atom::Integer, Atom::Number)
| (Atom::Number, Atom::Integer) => Atom::Number,
(Atom::String, Atom::String) => Atom::String,
(lhs, rhs) => {

The union of a string and bytes should probably be bytes, for now it'll be dropped or casted into a string.

{"oneOf": [{"type": "string"}, {"type": "string", "format": "bytes"}]}

Originally posted by @acmiyaguchi in #82 (comment)

Support PubSub schemas

PubSub has the SQL capabilities via Apache Calcite and Beam: https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-ui-walkthrough#assign-pubsub-schema

The format is in YAML, and fits in the scope of this tool. Here's the example taken from the docs page.

  - column: event_timestamp
    description: Pub/Sub event timestamp
    mode: REQUIRED
    type: TIMESTAMP
  - column: attributes
    description: Pub/Sub message attributes
    mode: NULLABLE
    type: MAP<STRING,STRING>
  - column: payload
    description: Pub/Sub message payload
    mode: NULLABLE
    type: STRUCT
    subcolumns:
    - column: tr_time_str
      description: Transaction time string
      mode: NULLABLE
      type: STRING
    - column: first_name
      description: First name
      mode: NULLABLE
      type: STRING
    - column: last_name
      description: Last name
      mode: NULLABLE
      type: STRING
    - column: city
      description: City
      mode: NULLABLE
      type: STRING
    - column: state
      description: State
      mode: NULLABLE
      type: STRING
    - column: product
      description: Product
      mode: NULLABLE
      type: STRING
    - column: amount
      description: Amount of transaction
      mode: NULLABLE
      type: FLOAT

Convert camel case bigquery column names to snake case

See mozilla/gcp-ingestion#671

The direct to parquet datasets coerce camel case keys to snake case, but right now our pipeline of pings into bigquery does not.

I think this consistent naming would be desirable and it would be best/simplest to handle it in the pipeline rather than deferring to views. This would require a coordinate change in the schema transpiler and in the BigQuery sink dataflow jobs.

Support bytes as a data type

There's some use of the bytes SQL type for storing arbitrary binary data. bytes are generally not well formed in JSON, but supported in Avro and BigQuery. A low-impact solution is to create a custom bytes format under the string type, as follows:

{
  "type": "string",
  "format": "bytes"
}

This schema isn't used to validate the payload because documents containing binary data may also contain control characters that invalidate the JSON documents. Instead, the schema is descriptive and used to generate Avro/BigQuery schemas instead.

See:
https://json-schema.org/understanding-json-schema/reference/string.html#format
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#bytes-type

Refactor transpilation into a state-machine

The API for this application is simple. To implement a new schema target, the following process is done:

  1. Define a type (i.e. a serde annotated data-structure) that describes the schema format
  2. Implement Into from ast::Tag into the target type

The Into trait is limited because it assumes that the conversion will succeed or panic. A panic within the into function fails quickly, but does not provide very much context. In #34, we would like to have proper error handling using the Result type.

A finite state machine (FSM) would provide a good interface for error handling. We should implement the following functions:

  1. A goto function to determine the next state - effectively an iterator defaulting to a depth-first traversal of the schema
  2. A failure function to generate an error result and indicate a possibility for backtracking

A single main function can then apply the goto functions until it reaches a successful state or collect (all possible) failures and surface them to the user.

Fail mozilla-pipeline-schemas CI if schemas contain nullable array elements

jsonschema can allow array elements to be null, but BigQuery can only make fields REPEATED or NULLABLE.

BigQuery parquet imports solve this by wrapping both the array and elements in structs, so that an array is transformed into a struct with one repeated field called list containing structs with one field element and both the outer struct and the element field can be NULLABLE while list is REPEATED.

The transpiler currently converts an array of nullable elements in a jsonschema to a REPEATED field in a BigQuery schema which cannot contain NULL. For example {"properties":{"mylist":{"items":{"type":["integer","null"]},"type":"array"}},"type":"object"} -> [{"mode":"REPEATED","name":"mylist","type":"INT64"}]. This causes issues where if a jsonschema allows a message that BigQuery rejects during a file load operation, the whole file is rejected.

This was discussed in the GCP Technical check-in on 2019-09-30 where it was determined that at this time due to backwards compatibility constraints the transpiler should error if schemas allow nullable array elements and mozilla-pipeline-schemas CI should fail if the transpiler can't transform schemas.

`build.rs` causes `cargo publish` to fail

error: failed to verify package tarball

Caused by:
  Source directory was modified by build.rs during cargo publish. Build scripts should not modify anything outside of OUT_DIR.

The build.rs script is used to format testing resources and to generate tests. This causes issues with cargo publish, which must be run with the --no-verify flag.

Decide how to handle union types (like [string, int]) in BQ

Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.

Options for how to express in BQ

One option is to default to string in this case, so 4 and "4" in the JSON both become string 4. In this case, we lose information about what the original type was, but that doesn't seem terribly important.

A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that 4 in JSON becomes string 4 and "4" in JSON becomes string"4" with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.

Another option is to turn this into a STRUCT<int INT64, string STRING> where only one of the values will be non-null.

Options for naming the fields

I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.

For example, if we decided to use a struct, we could change field field_name to field_name_struct in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type of field_name, necessitating recreating tables.

cc @acmiyaguchi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.