Comments (6)
Event arrays (and other tuple types)
Discussed in depth in #38, we model events in main ping, event ping, focus event ping, etc. as a "tuple" of 6 elements. Each of those elements has a specific type, but the types are different for the different positions.
We propose to have the jsonschema-transpiler support these "tuple" types as STRUCTs with fields named according to BigQuery's anonymous field syntax.
Values in the event ping at payload.events.parent
look like:
[
[
14721983,
"uptake.remotecontent.result",
"uptake",
"remotesettings",
"success",
{
"source": "settings-changes-monitoring",
"trigger": "timer",
"age": "240805"
}
],
...
]
The pipeline will turn this into a repeated record field payload.events.parent
in the BigQuery table like:
f0_ | f1_ | f2_ | f3_ | f4_ | f5_.key | f5_.value |
---|---|---|---|---|---|---|
14721983 | uptake.remotecontent.result | uptake | remotesettings | success | source | settings-changes-monitoring |
See table moz-fx-data-shared-prod.analysis.klukas_event_raw
for one day of data with events parsed into the above form.
Those field names aren't very useful, but they're easy to produce consistently in the pipeline and they match BQ conventions. With the prod data layout in BQ, though, we expect users to be accessing ping data through views, and we can provide more helpful names for these fields at the view layer.
So, in practice, a user would query the view telemetry.event
and get nicer names. I've produced an example of what the view will look like in moz-fx-data-shared-prod.analysis.klukas_event
:
SELECT
document_id,
parent_events.*
FROM
`moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
UNNEST(payload.events.parent) AS parent_events
LIMIT
2
from jsonschema-transpiler.
Union types
It's great to have a concrete type everywhere, but we unfortunately have plenty of ambiguous cases where our JSON schema has to allow multiple types, such as ["integer", "boolean"]
because different clients send the field as different types.
We have prior art for this in how BigQuery handles importing Avro files that contain union types. In such cases, the field becomes a struct with one field for each type in the union named like <avro_type_name>_value
. For each row, only one of the fields in the struct will be populated with a non-null value.
Like in the tuple case, views will give us some flexibility to present these structs in different ways for different contexts.
A concrete example in event
pings and main pings is environment.addons.theme.foreignInstall
which can be either a boolean or an int. This will be transformed in BQ to a STRUCT<boolean_value BOOLEAN, long_value INT64>
.
from jsonschema-transpiler.
Putting it all together
Here is a query on top of view moz-fx-data-shared-prod.analysis.klukas_event
that shows the proposed user interface for these two types of data:
SELECT
-- A field from the top level of the ping
document_id,
-- A union-type field, where we only care about cases where the client had a boolean value
environment.addons.theme.foreign_install.boolean_value AS theme_foreign_install,
-- Unnested contents of events from process "parent"
parent_events.*
FROM
`moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
UNNEST(payload.events.parent) AS parent_events
WHERE
environment.addons.theme.foreign_install.boolean_value = TRUE
LIMIT
10
Row | document_id | theme_foreign_install | event_timestamp | event_timestamp_parsed | event_category | event_method | event_object | event_string_value | event_map_values.key | event_map_values.value | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | f23484ce-8f18-4005-b0e2-7ade05cbf38e | true | 281505 | 2019-08-26 12:45:41.505 UTC | addonsManager | update | extension | 1 | updated_from | app | |
step | started | ||||||||||
addon_id | {b9db16a4-6edc-47ec-a1f4-b86292ed211d} | ||||||||||
2 | 2ef7818d-1efc-44c1-9bcf-0dad2490fdab | true | 11861113 | 2019-08-26 15:29:41.113 UTC | pwmgr | open_management | mainmenu | null |
from jsonschema-transpiler.
I'll always take rectangular data over nested data if you're offering it to me; Sunah says we can build an additional view on top to recreate something like the Parquet events
table. I think that would lead me to be less worried about the underlying representation.
I'm not sure how y'all came to the decision about how to represent histograms in main pings, but, similarly, representing the event extras as JSON objects rather than a key-value-struct-array would be a little goofy-looking but might be easier to work with.
The union type proposal makes sense to me; as a user, if I end up typing COALESCE(CAST(foo.integer_value AS STRING), foo.string_value)
a few times, that's not the end of the world. I wonder a little about whether we could get away with coercing ambiguous types to their string representation and throwing away the original type information but that seems worse in most respects.
from jsonschema-transpiler.
I wonder a little about whether we could get away with coercing ambiguous types to their string representation and throwing away the original type information but that seems worse in most respects.
It does feel like storing these values as a JSON string could be more convenient in some ways, but would introduce more ambiguities. We can preserve the original JSON types by including quotes in the JSON string if the original value was a string, but then the user would need to strip out the quotes. If we don't include the quotes, then we lose type information (which doesn't seem very important, but I could imagine cases where it turns out to be).
I'm imagining that in most cases union types are a historical artifact where ancient versions produce one of the types and will eventually fall off completely, so analyses will often be able to assume a single type and not worry about COALESCEing multiple potential types.
There might be room for a few UDFs to help with common cases of parsing these as well, but I think we'll understand that better once folks are making queries on top of these union types in the wild.
from jsonschema-transpiler.
We can preserve the original JSON types by including quotes in the JSON string if the original value was a string, but then the user would need to strip out the quotes.
Desktop Firefox only allows strings anyway for both the key and value, which we added to the event specification early on due to lack of union type support in Parquet. (See the blurb for "extra" here: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/collection/events.html#serialization-format)
from jsonschema-transpiler.
Related Issues (20)
- `build.rs` causes `cargo publish` to fail HOT 3
- Support `date-time` format in JSON schemas HOT 1
- Add integration tests against reference data
- Validate BigQuery schemas by round-trip via Avro HOT 1
- Validate input jsonschema before transpiling HOT 1
- Use `TIMESTAMP` instead of `DATETIME` in BigQuery HOT 1
- Consider supporting legacy BigQuery types HOT 4
- Provide an error handling mechanism to drop, cast, or panic on under-specified fields HOT 4
- Convert camel case bigquery column names to snake case HOT 8
- `--resolve drop` fails on simple schema with incompatible multi-typed items
- Support PubSub schemas
- Support bytes as a data type HOT 1
- Union of string and bytes should be bytes
- Add option to make all fields nullable for BQ schemas
- Decide how to handle union types (like [string, int]) in BQ HOT 10
- Fail mozilla-pipeline-schemas CI if schemas contain nullable array elements HOT 2
- doc.rs fails to build as of v1.6.0
- `<unknown>` property field is dropped from an object HOT 1
- [Proposal] Add python bindings
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jsonschema-transpiler.