hgrsd / drivel Goto Github PK

Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

License: MIT License

Rust 100.00%

json rust schema synthetic-data test-data-generator schema-inference

drivel's Introduction

drivel

drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (the drivel in question) based on the inferred schema.

Features

Schema Inference: drivel can analyze JSON input and infer its schema, including data types, array lengths, and object structures.
Data Generation: Based on the inferred schema, drivel can generate synthetic data that adheres to the inferred structure.
Easy to integrate: drivel reads JSON input from stdin and writes its output to stdout, allowing for easy integration into pipelines and workflows.

Installation

Binaries and a shell-based installer are available for each release.

To install the drivel executable through Cargo, ensure you have the Rust toolchain installed and run:

cargo install drivel

To add drivel as a dependency to your project, e.g., to use the schema inference engine, run:

cargo add drivel

Usage

Infer a schema from JSON input, and generate synthetic data based on the inferred schema.

Usage: drivel [OPTIONS] <COMMAND>

Commands:
  describe  Describe the inferred schema for the input data
  produce   Produce synthetic data adhering to the inferred schema
  help      Print this message or the help of the given subcommand(s)

Options:
      --infer-enum                     Infer that some string fields are enums based on the number of unique values seen
      --enum-max-uniq <ENUM_MAX_UNIQ>  The maximum ratio of unique values to total values for a field to be considered an enum. Default = 0.1
      --enum-min-n <ENUM_MIN_N>        The minimum number of strings to consider when inferring enums. Default = 1
  -h, --help                           Print help
  -V, --version                        Print version

Examples

Consider a JSON file input.json:

{
  "name": "John Doe",
  "id": "0e3a99a5-0201-4444-9ab1-8343fac56233",
  "age": 30,
  "is_student": false,
  "grades": [85, 90, 78],
  "address": {
    "city": "New York",
    "zip_code": "10001"
  }
}

Running drivel in 'describe' mode:

cat input.json | drivel describe

Output:

{
  "age": int (30),
  "address": {
    "city": string (8),
    "zip_code": string (5)
  },
  "is_student": boolean,
  "grades": [
    int (78-90)
  ] (3),
  "name": string (8),
  "id": string (uuid)
}

Running drivel in 'produce' mode:

cat input.json | drivel produce -n 3

Output:

[
  {
    "address": {
      "city": "o oowrYN",
      "zip_code": "01110"
    },
    "age": 30,
    "grades": [83, 88, 88],
    "is_student": true,
    "name": "nJ heo D",
    "id": "9e0a7687-800d-404b-835f-e7d803b60380"
  },
  {
    "address": {
      "city": "oro wwNN",
      "zip_code": "11000"
    },
    "age": 30,
    "grades": [83, 88, 89],
    "is_student": false,
    "name": "oeoooeeh",
    "id": "c6884c6b-4f6a-4788-a048-e749ec30793d"
  },
  {
    "address": {
      "city": "orww ok ",
      "zip_code": "00010"
    },
    "age": 30,
    "grades": [85, 90, 86],
    "is_student": false,
    "name": "ehnDoJDo",
    "id": "71884608-2760-4853-8c12-e11149c642cd"
  }
]

Contributing

We welcome contributions from anyone interested in improving or extending drivel! Whether you have ideas for new features, bug fixes, or improvements to the documentation, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

drivel's People

Contributors

Stargazers

Watchers

Forkers

turingbuilder

drivel's Issues

Make configurable when to generate random strings vs choose from a sample

For some data sets, it might not be useful to generate random strings based on the sample of characters observed. Instead, it might be better to treat strings with multiple values as enums, where one of the enum variants is picked at random whenever a value is produced.

Support multi-object files

It currently panics when you give it a file with one object per line, like

{"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.09175515174865723, "timestamp": "2023-09-05T21:38:34+0000", "response_len": 450, "properties": {}}
{"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.0032494068145751953, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 672, "properties": {}}
{"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/file", "status_code": 200, "latency": 0.3335745334625244, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 85195, "properties": {}}
{"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.04735732078552246, "timestamp": "2023-09-05T22:13:40+0000", "response_len": 459, "properties": {}}
{"user_id": 3, "user_email": "[email protected]", "user_name": "Justin", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.002116680145263672, "timestamp": "2023-09-05T22:13:42+0000", "response_len": 672, "properties": {}}

An easy way to handle this might be to ignore everything after the first complete object, e.g. this works fine:

❯ cat usage_log.json | head -n1 | drivel describe

But realistically, I'd want it treated as if the top level is an implicit array

Use clap for argument parsing

To enable expansion of functionality in future, we should be using clap for argument parsing rather than the rudimentary approach currently in use.

Audit dependency features and remove unnecessary deps

Make sure we select the smallest possible feature set for each of the dependencies we bring in

Add support for user-defined schemas

Description generated by Claude 3

Currently, drivel infers the schema from the provided JSON input and generates data based on that inferred schema. While this is useful, there are cases where users may want to define their own schema and have drivel generate data based on that schema.

We should add a new feature that allows users to provide a schema file (in a format to be determined, possibly JSON Schema or a custom format) that describes the desired data structure. drivel should then generate data that conforms to this user-defined schema.

This feature would be beneficial in scenarios where:

The user wants to generate data with a specific structure that may not be easily inferred from a single JSON example.
The user wants to generate data with certain constraints or patterns that are not present in the example JSON.
The user wants to generate data for a schema that they have designed beforehand, without needing to provide a JSON example.

To implement this feature, we would need to:

Design a schema format (or adopt an existing one) that allows users to define the desired data structure, including field names, types, constraints, and relationships between fields.
Add a new command-line flag (e.g., --schema-file) that allows users to specify the path to their schema file.
Modify the produce mode to check for the presence of a user-defined schema file. If present, use that schema for data generation instead of inferring the schema from the input JSON.
Update the documentation to describe how to use this new feature, including examples of the schema format and how to generate data from a user-defined schema.

This feature would greatly enhance the flexibility and usefulness of drivel, allowing users to generate data for a wider variety of use cases and scenarios.

Graceful error handling on parsing failure

We shouldn't panic when the input from stdin isn't valid JSON and can't be parsed. Instead, we should surface a friendlier error to the user.

Use multiple CPU cores for schema inference and data production

When working on seriously large JSON files (100s of MBs to GBs), schema inference starts to become a little bit slow. The same is true for producing a large number of values.

We can leverage a crate like rayon to utilise multiple CPU cores for the inference and production logic, where things can be parallelised. For infererring, for instance, we can parallelise inferring the type of each array element and folding the merge function over them. For producing, we can parallelise produce all values in the root array in parallel.

Automatically build and release binaries

Use a tool like https://opensource.axo.dev/cargo-dist/book/introduction.html to automatically build common binaries and make them available as a release whenever a tag is pushed.