Giter VIP home page Giter VIP logo

drivel's Introduction

drivel

drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (the drivel in question) based on the inferred schema.

Features

  • Schema Inference: drivel can analyze JSON input and infer its schema, including data types, array lengths, and object structures.
  • Data Generation: Based on the inferred schema, drivel can generate synthetic data that adheres to the inferred structure.
  • Easy to integrate: drivel reads JSON input from stdin and writes its output to stdout, allowing for easy integration into pipelines and workflows.

Installation

Binaries and a shell-based installer are available for each release.

To install the drivel executable through Cargo, ensure you have the Rust toolchain installed and run:

cargo install drivel

To add drivel as a dependency to your project, e.g., to use the schema inference engine, run:

cargo add drivel

Usage

Infer a schema from JSON input, and generate synthetic data based on the inferred schema.

Usage: drivel [OPTIONS] <COMMAND>

Commands:
  describe  Describe the inferred schema for the input data
  produce   Produce synthetic data adhering to the inferred schema
  help      Print this message or the help of the given subcommand(s)

Options:
      --infer-enum                     Infer that some string fields are enums based on the number of unique values seen
      --enum-max-uniq <ENUM_MAX_UNIQ>  The maximum ratio of unique values to total values for a field to be considered an enum. Default = 0.1
      --enum-min-n <ENUM_MIN_N>        The minimum number of strings to consider when inferring enums. Default = 1
  -h, --help                           Print help
  -V, --version                        Print version

Examples

Consider a JSON file input.json:

{
  "name": "John Doe",
  "id": "0e3a99a5-0201-4444-9ab1-8343fac56233",
  "age": 30,
  "is_student": false,
  "grades": [85, 90, 78],
  "address": {
    "city": "New York",
    "zip_code": "10001"
  }
}

Running drivel in 'describe' mode:

cat input.json | drivel describe

Output:

{
  "age": int (30),
  "address": {
    "city": string (8),
    "zip_code": string (5)
  },
  "is_student": boolean,
  "grades": [
    int (78-90)
  ] (3),
  "name": string (8),
  "id": string (uuid)
}

Running drivel in 'produce' mode:

cat input.json | drivel produce -n 3

Output:

[
  {
    "address": {
      "city": "o oowrYN",
      "zip_code": "01110"
    },
    "age": 30,
    "grades": [83, 88, 88],
    "is_student": true,
    "name": "nJ heo D",
    "id": "9e0a7687-800d-404b-835f-e7d803b60380"
  },
  {
    "address": {
      "city": "oro wwNN",
      "zip_code": "11000"
    },
    "age": 30,
    "grades": [83, 88, 89],
    "is_student": false,
    "name": "oeoooeeh",
    "id": "c6884c6b-4f6a-4788-a048-e749ec30793d"
  },
  {
    "address": {
      "city": "orww ok ",
      "zip_code": "00010"
    },
    "age": 30,
    "grades": [85, 90, 86],
    "is_student": false,
    "name": "ehnDoJDo",
    "id": "71884608-2760-4853-8c12-e11149c642cd"
  }
]

Contributing

We welcome contributions from anyone interested in improving or extending drivel! Whether you have ideas for new features, bug fixes, or improvements to the documentation, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

drivel's People

Contributors

hgrsd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

turingbuilder

drivel's Issues

Support multi-object files

It currently panics when you give it a file with one object per line, like

{"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.09175515174865723, "timestamp": "2023-09-05T21:38:34+0000", "response_len": 450, "properties": {}}
{"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.0032494068145751953, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 672, "properties": {}}
{"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/file", "status_code": 200, "latency": 0.3335745334625244, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 85195, "properties": {}}
{"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.04735732078552246, "timestamp": "2023-09-05T22:13:40+0000", "response_len": 459, "properties": {}}
{"user_id": 3, "user_email": "[email protected]", "user_name": "Justin", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.002116680145263672, "timestamp": "2023-09-05T22:13:42+0000", "response_len": 672, "properties": {}}

An easy way to handle this might be to ignore everything after the first complete object, e.g. this works fine:

❯ cat usage_log.json | head -n1 | drivel describe

But realistically, I'd want it treated as if the top level is an implicit array

Use clap for argument parsing

To enable expansion of functionality in future, we should be using clap for argument parsing rather than the rudimentary approach currently in use.

Add support for user-defined schemas

Description generated by Claude 3

Currently, drivel infers the schema from the provided JSON input and generates data based on that inferred schema. While this is useful, there are cases where users may want to define their own schema and have drivel generate data based on that schema.

We should add a new feature that allows users to provide a schema file (in a format to be determined, possibly JSON Schema or a custom format) that describes the desired data structure. drivel should then generate data that conforms to this user-defined schema.

This feature would be beneficial in scenarios where:

  • The user wants to generate data with a specific structure that may not be easily inferred from a single JSON example.
  • The user wants to generate data with certain constraints or patterns that are not present in the example JSON.
  • The user wants to generate data for a schema that they have designed beforehand, without needing to provide a JSON example.

To implement this feature, we would need to:

  • Design a schema format (or adopt an existing one) that allows users to define the desired data structure, including field names, types, constraints, and relationships between fields.
  • Add a new command-line flag (e.g., --schema-file) that allows users to specify the path to their schema file.
  • Modify the produce mode to check for the presence of a user-defined schema file. If present, use that schema for data generation instead of inferring the schema from the input JSON.
  • Update the documentation to describe how to use this new feature, including examples of the schema format and how to generate data from a user-defined schema.

This feature would greatly enhance the flexibility and usefulness of drivel, allowing users to generate data for a wider variety of use cases and scenarios.

Use multiple CPU cores for schema inference and data production

When working on seriously large JSON files (100s of MBs to GBs), schema inference starts to become a little bit slow. The same is true for producing a large number of values.

We can leverage a crate like rayon to utilise multiple CPU cores for the inference and production logic, where things can be parallelised. For infererring, for instance, we can parallelise inferring the type of each array element and folding the merge function over them. For producing, we can parallelise produce all values in the root array in parallel.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.