Giter VIP home page Giter VIP logo

Comments (8)

acmiyaguchi avatar acmiyaguchi commented on August 20, 2024

Fixed in #37, it will almost always be casted into JSON.

from jsonschema-transpiler.

acmiyaguchi avatar acmiyaguchi commented on August 20, 2024

While this is no longer immediately broken, it should be supported properly as either an anonymous struct (_0, _1, _2, etc based on the min/max number of fields), or a regular struct with a single list item. The latter will require coordination from the decoding piece in mozilla/gcp-ingestion.

from jsonschema-transpiler.

relud avatar relud commented on August 20, 2024

an anonymous struct (_0, _1, _2, etc based on the min/max number of fields)

if we do this we should follow the field naming that BigQuery uses for anonymous structs (_f0, _f1, _f2, etc)

or a regular struct with a single list item

As discussed in the team meeting today, we could use a oneOf condition to specify a struct alongside tuples, providing both an upgrade path and the necessary info to coerce tuples to structs.

from jsonschema-transpiler.

acmiyaguchi avatar acmiyaguchi commented on August 20, 2024

As discussed in the team meeting today, we could use a oneOf condition to specify a struct alongside tuples, providing both an upgrade path and the necessary info to coerce tuples to structs.

This is a nice solution. It seems like there should be some idea of precedence among the different types, such as the Union[Tuple, Object] or Union[Tuple, String]. This should be handled by ast::Union::collapse:

/// Collapse a union of types into a structurally compatible type.
///
/// Typically, variant types are not allowed in a table schema. If a variant type
/// type is found, it will be converted into a JSON type. Because of the ambiguity
/// around finding structure in a JSON blob, the union of any type with JSON will
/// be consumed by the JSON type. In a similar fashion, a table schema is determined
/// to be nullable or required via occurances of null types in unions.
pub fn collapse(&self) -> Tag {

I'm a bit wary of adding too much custom logic into the encoder, but this seems like it strikes a good balance.

from jsonschema-transpiler.

jklukas avatar jklukas commented on August 20, 2024

if we do this we should follow the field naming that BigQuery uses for anonymous structs (_f0, _f1, _f2, etc)

If we do this, we could still support nice naming for event-type fields at the view layer. So for the event ping, the telemetry.events view would replace the _f0, etc. names with more descriptive variants.

from jsonschema-transpiler.

jklukas avatar jklukas commented on August 20, 2024

I'm working on a POC of what one day of event would look like with this scheme. I'm going to write up a short proposal to share the interface with data scientists and gather any feedback.

Here's the query to create what the _stable table will look like:

CREATE TEMP FUNCTION
  udf_js_json_extract_events (input STRING)
  RETURNS ARRAY<STRUCT< f0_ INT64,
  f1_ STRING,
  f2_ STRING,
  f3_ STRING,
  f4_ STRING,
  f5_ ARRAY<STRUCT<key STRING,
  value STRING>> >>
  LANGUAGE js AS """
    if (input == null) {
      return null;
    }
    var parsed = JSON.parse(input);
    var result = [];
      for (var event of parsed) {
        map_values = []
        for (var key in event[5]) {
          map_values.push({"key": key, "value": event[5][key]})
        }
        var structured = {
          "f0_": event[0],
          "f1_": event[1],
          "f2_": event[2],
          "f3_": event[3],
          "f4_": event[4],
          "f5_": map_values
        }
        result.push(structured)
    }
    return result;
""";
CREATE or replace TABLE
  `moz-fx-data-shared-prod.analysis.klukas_event_raw`
PARTITION BY
  DATE(submission_timestamp)
CLUSTER BY
  sample_id AS
SELECT
  * REPLACE((
    SELECT
      AS STRUCT payload.*,
      STRUCT( udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.parent')) AS parent,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.content')) AS content,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.extension')) AS extension,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.gpu')) AS gpu,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.dynamic')) AS dynamic ) AS events) AS payload)
FROM
  `moz-fx-data-shared-prod.telemetry_stable.event_v4`
WHERE
  DATE(submission_timestamp) = "2019-08-26"

from jsonschema-transpiler.

jklukas avatar jklukas commented on August 20, 2024

On top of the above table (which gives names f0_, etc. to fields), here is a view that gives logical names to the fields and additionally parses the timestamp value into a BQ TIMESTAMP type:

SELECT
  * REPLACE((
    SELECT
      AS STRUCT payload.* REPLACE ( STRUCT(ARRAY(
          SELECT
            AS STRUCT f0_ AS event_timestamp,
            TIMESTAMP_MILLIS(payload.process_start_timestamp + f0_) AS event_timestamp_parsed,
            f1_ AS event_category,
            f2_ AS event_method,
            f3_ AS event_object,
            f4_ AS event_string_value,
            f5_ event_map_values
          FROM
            UNNEST(payload.events.parent)) AS parent) AS events)) AS payload)
FROM
  `moz-fx-data-shared-prod.analysis.klukas_event_raw`

Users would then likely use this view with an UNNEST like:

SELECT
  document_id,
  parent_events.*
FROM
  `moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
  UNNEST(payload.events.parent) AS parent_events
LIMIT
  2
Row document_id event_timestamp event_timestamp_parsed event_category event_method event_object event_string_value event_map_values.key event_map_values.value  
1 5f134697-3831-44b3-9416-9c4910b84fbb 266415 2019-08-25 20:21:26.415 UTC uptake.remotecontent.result uptake remotesettings network_error source settings-changes-monitoring  
                trigger timer  
2 b311fd60-bdc9-402d-861a-51f1bed7be0a 180960 2019-08-26 01:26:00.960 UTC uptake.remotecontent.result uptake remotesettings success source settings-changes-monitoring  
                trigger timer  
                age 260558

from jsonschema-transpiler.

jklukas avatar jklukas commented on August 20, 2024

Seeking feedback on this from data users in #88

from jsonschema-transpiler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.