Giter VIP home page Giter VIP logo

finos / datahelix Goto Github PK

View Code? Open in Web Editor NEW
140.0 31.0 50.0 14.85 MB

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation

Home Page: https://finos.github.io/datahelix/

License: Apache License 2.0

Java 81.66% Gherkin 17.91% Dockerfile 0.03% Shell 0.13% PowerShell 0.10% Roff 0.17%
data-engineering data-generation data-generator java test-data-generator

datahelix's Introduction

FINOS - Archived

This project is archived, which means that it's in read-only state. You can download and use this code, the project is entirely functional, and you are welcome to use it but please be aware of the risk of bugs and security vulnerabilities. If you're interested to restore development activities on this project, please email [email protected]

DataHelix Generator CircleCI

DataHelix logo

The generation of representative test and simulation data is a challenging and time-consuming task. Although DataHelix was created to address a specific challenge in the financial services industry, you will find it a useful tool for the generation of realistic data for simulation and testing, regardless of industry sector. All this from a straightforward JSON data profile document.

DataHelix is a proud member of the Fintech Open Source Foundation and operates within the FINOS Data Technologies Program.

Key documents

  • For information on how to get started with DataHelix see our Getting Started guide.

  • For information on the syntax of DataHelix profiles see the User Guide.

  • For information on how to contribute to the project, and more technical information about DataHelix, see the Developer Guide.

  • For a high level road map see Road Map.

The Problem

When performing a wide range of software development tasks - functional or load testing on a system, prototyping an API or pitching a service to a potential customer - sample data is a necessity, but generating and maintaining it can be difficult. The nature of some industries makes it particularly difficult to cleanly manage data schemas, and sample datasets:

  • Regulatory and methodological change often forces data schema changes.
  • It is often difficult to completely remove legacy data due to obligations to maintain deprecated products. Because of this, schemas tend to be progressively complicated with special cases and exceptions.
  • Errors can be costly and reputation-damaging.
  • For legal and/or privacy reasons, it is normally impossible to include real data in samples.

For all the above reasons, it is common to handcraft sample datasets. This approach brings several problems:

  • It costs significant time up-front, and thereafter every time the schema changes.
  • It's very easy to introduce errors.
  • The sample data is unlikely to exhaustively cover all test cases.
  • The sample data is not self-documenting, and documentation is likely to become out of date.

For data generation, partial solutions are available in services/libraries such as TSimulus, Mockaroo or GenRocket. However, these have limitations:

  • They are limited to relatively simple data schemas with limited dependencies between fields
  • None of them offer a complete end-to-end solution of profiling existing data to discover trends and constraints, generating from those constraints, and validating against them.
  • Complex behaviour (if available) is modelled in an imperative style, forcing the user to design the process for generating the data using the library's toolbox, rather than a declarative style that describes the shape of the data and leaves it to the library to determine how to create it.

The Mission

We aim to solve (at least) the following user needs:

  • "I want to generate test cases for my validation procedures."
  • "I want to generate sample data to document my API, or to use in a non-production version of my API."
  • "I want to validate some data against a known specification or implementation."
  • "I want to measure my existing test data's coverage against the range of possible data."
  • "I want to generate an exhaustive set of data, for testing my API's robustness."

The Product

A suite of tools:

  • To generate data based on a declarative profile, either from the command-line, or through a restful API which can be called manually or through a web front end.
  • To create a data profile from a dataset, including identifying constraints and relationships between the dataset's fields, so that similarly-shaped mock data can be generated using the profile.
  • To validate a dataset against a data profile.

Contributing

  1. Fork it (https://github.com/yourname/yourproject/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Read our contribution guidelines and Community Code of Conduct
  4. Commit your changes (git commit -am 'Add some fooBar')
  5. Push to the branch (git push origin feature/fooBar)
  6. Create a new Pull Request

NOTE: Commits and pull requests to FINOS repositories will only be accepted from those contributors with an active, executed Individual Contributor License Agreement (ICLA) with FINOS OR who are covered under an existing and active Corporate Contribution License Agreement (CCLA) executed with FINOS. Commits from individuals not covered under an ICLA or CCLA will be flagged and blocked by the FINOS Clabot tool. Please note that some CCLAs require individuals/employees to be explicitly named on the CCLA.

Need an ICLA? Unsure if you are covered under an existing CCLA? Email [email protected]

License

Copyright 2019 Scott Logic Ltd.

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0.

datahelix's People

Contributors

afroggattsl avatar amehta90 avatar cakehurstryan avatar cdowding-sl avatar d-withers avatar dkdiep avatar elliehield avatar fbromley avatar harrybedfordsl avatar hashbyhayter avatar hchapman-sl avatar jharrissl avatar joelmatth avatar khanp-sl avatar leeyuiwah-sl avatar mattcline-sl avatar mrspaceman avatar ms14981 avatar pdaulbyscottlogic avatar r-stuart avatar rstuart-scottlogic avatar scottlogic-alex avatar sl-mark avatar sl-slaing avatar steve-tennantsl avatar tom-hayden avatar tomgilbert84 avatar tomhall321 avatar tyankovasc avatar willsalt-sl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datahelix's Issues

Introduce dependency injection

My test setup looks like this:

private final FieldSpecMerger fieldSpecMerger = new FieldSpecMerger(
        new SetRestrictionsMerger(),
        new NumericRestrictionsMerger(),
        new StringRestrictionsMerger(),
        new NullRestrictionsMerger(),
        new TypeRestrictionsMerger(),
        new DateTimeRestrictionsMerger()
);

private final ConstraintReducer constraintReducer = new ConstraintReducer(
        new ConstraintFieldSniffer(),
        new FieldSpecFactory(
                new AutomatonFactory()
        ),
        fieldSpecMerger
);

private final DecisionTreeWalker dTreeWalker = new DecisionTreeWalker(
        constraintReducer,
        new RowSpecMerger(
                fieldSpecMerger
        )
);

It'd be nicer if I could just do:

private final DecisionTreeWalker dTreeWalker = injector.getInstance();

This would be possible if we had a superclass for tests. Bonus points for doing it without inheritance (there's probably a clever way in JUnit5).

We already use Guice injector in the Scala profiler project. I even integrated the injector into that project's tests, but we may want to do so in a Slightly Less Magic way than that.

Move user-facing documentation from wiki to source control

Feature Request

Description of Problem:

Our wiki contains a combination of user- and developer-facing content. Also, when we eg add new constraints we need to remember to update the wiki after a PR is merged.

Potential Solutions:

Put the user-facing stuff (eg, profile schema, constraint grammar, list of constraints) into .md files in the repo.

It'll likely be too much for one file, so we should use links between our files:

https://blog.github.com/2013-01-31-relative-links-in-markup-files/

Most of the important documentation right now is of the profile, but when we document things like command line parameters we'll want to link to .md files in the module directories (ie: generator, profiler dirs)

JSON Schema

Feature Request

Description of Problem:

Data generated and visualisation graphs are not showing expected results due to errors in the profile used.
To make sure that the profile fits expectations, a schema file can be used to validate that the JSON values have the correct hierarchy, ordering and values.

This will be used to validate the profile used via the command line or in a GUI editor/front end.

Potential Solutions:

  • manually write a JSON schema file or Infer a schema from a full-featured profile and edit it to fix specific cases.

  • refactor GenerationConfigValidator and VisulatisationConfigValidator to subclass a CommandLineValidator so that we can validate the profile from the command ancestor class.

  • use one if the following libraries for validation to allow us to use draft-7 of the JSON schema:

Acceptance Criteria:

  • Re-version Schemas package to v0.1 - #693
  • Update example files to include schema directive and new schema version - #693
  • Check if schema store works with VS Code - #694
  • PoC multiple schema versions in a single schema file - #696
  • JSON schema file created (for v0.1 of the DataHelix profile) and committed to the repository. - #649
  • Documentation updated to describe how the schema could be used by a third party. - #697
  • Code changes made to validate the input profile against the schema. - #698
  • Documentation for the validator updated to show the use of the schema. - #698
  • Tests created to prove that the schema is valid, details of these tests should be contained in the documentation. - #699
  • Determine scenarios for testing the validation of a profile against a schema - #700
  • Implement/change framework to support testing validation of a profile against a schema - #700

Initiate generation from the Web App

In the Web App, I want to upload a profile file from my computer and be presented with some data (see #36 for current expected output method).

For this story, generated file should match columns in profile but all data should be nulled. Further functionality to be added in constraint-specific stories.

Improve "interesting" values generated for length constrained strings

Feature Request

Description of Problem:

Currently when generating "interesting" values for strings that only have a min/max length it will produce a string for each possible length.

For example a string that's shorter than 100, it will generate 100 values.

Potential Solutions:

As with other data-types, producing boundary values could be more useful. In the above example, it would generate strings of 0, 1 and 99 characters length.

Initiate profiling from the Web App

created = 12/07

  • Make frontend interact with Profiler API

We already have a dialog where the user can enter a local file path to some data. We need to make it so that when the submit action is dispatched, we talk to the Profiler, display a loading spinner(?) and then, on success or error, display appropriate updates to the user.

  • Make a HTTP API around the profiler

The frontend should be able to contact the profiler to request that it starts profiling and (eventually) returns a profile.

Various possible approaches:

Make a REST API. Accepts a POST and returns a profile as application/json.
Make a REST API. Accepts a POST and returns some 'session' information. Allows subsequent polling to check status of request, or additional POSTs to provide more information.
Make a WebSockets API. Maintain an open connection throughout profiling, by which progress messages/requests for interaction can be shared. (what if connection drops? does user lose entire profiling?)

Our original design assumed users would trigger profiling through a CLI, but we've gradually come to the belief that profiling will be an interactive, iterative experience, and that the web app will be the best way for the profiling service to communicate with the user.

Acceptance criteria:
In the Web App there should be a way to trigger any type of profiling we currently support (at time of writing, just from flat file on filesystem)
The Web App should indicate that background processing is occurring
On completion, the Web App should load the created profile into the profile editor

Derive numeric/datetime formats during profiling

A naive solution would flexibly parse the input data and then forget the original format, but it undermines the faithfulness of the data if, eg, input dates are expressed as 23/11/2013 but our sample data has 2013-11-23.

Dates are relatively simple to solve since it's usually easy to unambiguously deduce the format from a string. Numbers might be more awkward since you might need to examine multiple cases to derive the full formatting rules (eg, if there are non-fixed-length fractional components). Some possibilities would be especially painful (eg, if input has fixed number of significant figures).

What if formats vary? Should we output in a comparable distribution, or just choose the most populous/recent?

Generate temporal data

Feature Request

Description of Problem:

Temporal constraints are recognised by the system and used to build temporal restrictions, but these are ignored during generation.

Potential Solutions:

Create a new IFieldValueGenerator. Pick a default formatting (at the databag generation phase, not the output phase) if not provided.

As an Analyst I want my data profile to reflect basic independent trends in my temporal fields

Tasks

  • Compute temporal min/max
  • Compute null prevalence
  • Implement a semantic classifier to recognise dates
    -- TemporalAnalyser:
    -- Takes in DataFrame and field
    -- Extract field type
    -- Overload for numeric & string
    -- Parse to fitting DF type
    -- Return new DF
    -- (in future consider retaining type to maintain schema of original data)
  • Spike: string format validation on Spark Columns

The profile should contain:

  • name
  • timestamp type (insert, update, future)
  • lower/upper temporal bounds
  • distribution of timestamps across range (e.g. equally spaced, bunched)
  • noise in distribution
  • null prevalence

(crossed out items reflect previous requirements - no longer part of MVP)

Profile schema proposal

{
	schemaVersion: "1",
	fields:[{
		name: "quantity",
		kind: "numeric",
		nullPrevalence: 0.5,
		distribution: {
			kind: "normal",
			meanAvg: 1234,
			stdDev: 1,
			min: 1,
			max: 10000
		},
		format: {
			// consider also .NET's syntax, 0.###
			kind: "sprintf",
			template: "%d"
		}
	}, {
		name: "message",
		kind: "text",
		nullPrevalence: 0.5,
		distribution: {
			kind: "perCharacterRandom",
			// after MVP: consider introducing length distribution object
			lengthMin: 1,
			lengthMax: 2,
			alphabet: "the quickbrownfx"
		},
		// text may not need an explicit formatter
		format: {
			kind: "sprintf",
			template: "%s"
		}
	}, {
		name: "comment",
		kind: "text",
		nullPrevalence: 0.5,
		distribution: {
			// lorem ipsum not required for MVP
			kind: "loremipsum",
			// after MVP: consider introducing length distribution object
			lengthMin: 1,
			lengthMax: 2,
			// future proposal:
			length: {
				distribution: {
					kind: "normal",
					// etc
				}
			}
		},
		// text may not need aformatter
		format: {
			kind: "sprintf",
			template: "%s"
		}
	}, {
		name: "vehicleType",
		kind: "enum",
		nullPrevalence: 0.5,
		distribution: {
			kind: "set",
			members: [{
				name: "salad",
				prevalence: 0.4
			}]
		},
		// enums may not need an explicit formatter
		format: {
			kind: "sprintf",
			template: "%s"
		}
	}, {
		name: "fmt1",
		kind: "temporal",
		nullPrevalence: 0.5,
		distribution: {
			kind: "normal",
			meanAvg: 1234,
			stdDev: 1,
			min: 1,
			max: 10000
		},
		// don't need expressive temporal formatting for MVP
		format: {
			kind: "epoch"
		}
	}, {
		name: "fmt2",
		kind: "temporal"
		nullPrevalence: 0.5,
		distribution: {
			kind: "normal",
			meanAvg: 1234,
			stdDev: 1,
			min: 1,
			max: 10000
		},
		// don't need expressive temporal formatting for MVP
		format: {
			// perhaps generalize to a date formatter template string
			kind: "ISO8601"
		}
	}]
}

Add command line options for different generation modes

Feature Request

Description of Problem:

We have different ways of generating data now, corresponding to multiple ways to build GenerationConfig objects, but the CLI just allows one. If we want to generate, for instance, exhaustive data, we have to make temporary code changes.

Our first milestone doesn't strictly depend on this functionality, but it's important for testing and feature exploration.

Potential Solutions:

Add some basic command line parameters. At least the following use cases:

  • Exhaustive (generates one file of valid data)
  • Interesting (generates one file of valid data)
  • Test cases (generates multiple files, valid and invalid data)

We'll almost certainly change this in future as we expand the options, possibly even to an approach that's based on job files rather than CLI parameters.

Generate non-integral number types

Feature Request

Description of Problem:

All generated numeric data is currently integer format, and (I think) it fails on range constraints with non-integral values.

Potential Solutions:

We should introduce some way to distinguish in the profile between natural and real numbers. Would likely involve creation of (eg) RealNumberFieldValueSource, which is infinite as compared to finite integer sources.

What then should it output when generateAllValues is called?

Also, should we make a distinction between unconstrained real numbers and, eg, the number component of £3.01? The latter is really an integral number of pennies.

As a user, I would like the JSON profile schema to include a description property

As a user, I would like the JSON profile schema to include a description property so that an overview of what data the profile is expected to output can be seen from within the JSON file itself so that I can understand the intention of the profile without having to decipher the logic.

e.g.

{
    "schemaVersion": "v3",
    "description": "Data profile for v1 of products database",
    "fields": [...],
    "rules": [...]
}

Reduce duplication where (eg) decisions have overlapping options

Feature Request

Description of Problem:

{
	"schemaVersion": "v3",
	"fields": [
		{ "name": "title" }
	],
	"rules": [
		{ "anyOf": [
			{ "field": "title", "is": "inSet", "values": [ "mr", "mrs" ] },
			{ "field": "title", "is": "inSet", "values": [ "mrs", "dr" ] }
		] }
	]
}

Outputs both null and "mrs" twice.

Potential Solutions:

On the field level, we can filter duplicates by putting a filter on eg the field-specific IDataBagSource. That works for cases like the profile above.

On the row level, we can filter duplicates across the entire dataset.

It gets more complicated when we have multiple fields affected by multiple decision nodes. The first step to examining this issue is probably to play around with weirder kinds of profile and try to contrive some misbehaving examples.

Introduce CI to generator and profiler

Feature Request

Description of Problem:

Presently, it's possible to make a pull request in which there are unnoticed test failures.

Potential Solutions:

We should add new entries to our AWS Codebuild setup and set up any needed hooks with GitHub. Additionally, produce documentation for our CI in Confluence (because GitHub should be just about the code?).

Our JUnit setup is more complex than it needs to be

Ah, it's possible that our JUnit dependencies have always been a bit wrong, and are still afflicted by that.
They were built by following some example to the letter, but I can no longer find that example.

It looks like the recommended way is actually a lot simpler.

JUnit 5 installation readme:

https://junit.org/junit5/docs/current/user-guide/#installation

Samples repository:

https://github.com/junit-team/junit5-samples

JUnit 5 only:

https://github.com/junit-team/junit5-samples/blob/master/junit5-jupiter-starter-maven/pom.xml

JUnit 5 + JUnit 4 support:

https://github.com/junit-team/junit5-samples/blob/master/junit5-migration-maven/pom.xml

In conclusion: I think we need to:

  • testCompile against junit-jupiter-api
  • include junit-jupiter-engine at testRuntime (well, I think there's no way to express this in Maven)
  • we don't need the junit-jupiter-params that they are recommending in that template
    • because we don't yet do any parameterized tests
  • we continue to need the hamcrest matchers
  • ensure we are using a recent version of maven-surefire-plugin
    • only relevant for command-line build; I think IntelliJ provides its own testrunner
  • let's see what happens if we drop JUnit vintage support (may need to lightly rewrite our imports to code against the JUnit 5 API)
  • let's see what happens if we drop junit-platform-runner (I don't know what it is)
    • confirm that IDE test run and command-line test run still work
  • let's see what happens if we drop junit
    • supposedly this provides IDE support
    • I suspect it has the side-effect of risking our accidentally compiling against JUnit 4

I would like examples of pre-written profiles included in the repository.

I would like examples of pre-written profiles included in the repository so that

  • I can produce test data as soon as I have the repository on my machine to verify my setup

  • I can see how a profile should be formatted

  • I can save time by modifying the example files for my needs

Follow the link below to my fork of this repo where I've added a manual-profiles folder containing sample hand written profiles
https://github.com/hwilliams-sl/data-engineering-generator/tree/master/generator/manual-profiles

Dockerise application

  • Containerise webapp

Should:

  • Watch feature branches and automatically build/test them, updating pull requests accordingly
  • Build master and queue something for later deployment
  • On deployment, put it in some location we can access

SetRestrictionsMerger combines whitelist and blacklist incorrectly

Bug Report

Steps to Reproduce:

{
	"schemaVersion": "v3",
	"fields": [
		{ "name": "title" }
	],
	"rules": [
		{ "field": "title", "is": "inSet", "values": [ "mr", "mrs" ] },
		{ "not": { "field": "title", "is": "inSet", "values": [ "mrs", "dr" ] } }
	]
}

Expected Result:

Row where title = mr.

Actual Result:

Only null row.

Generate valid ISIN codes

Feature Request

Description of Problem:

I want to be able to specify that a field must be populated by valid ISIN codes. I should also be able to restrict the ISIN codes generated by regex, such as by saying that it must start with "GB". Generated ISIN codes must be valid:

  • They must have correct overall structure
  • They must have the correct internal structure for any regional types implemented (eg, ISINS with GB prefix must contain a valid SEDOL)
  • They must have correct checksums

Update README.md for more Eclipse instruction

Bug Report

This is to update README.md for more instruction about using the Eclipse IDE. We may also reorganize the info a bit so that some info in profile/README.md may be moved to the top-level README.md

RegexStringGenerator.generateAllValues() intermittently produces values in the wrong order

Bug Report

Steps to Reproduce:

Run test shouldGenerateStringsInLexicographicalOrder in RegexStringGeneratorTests

Expected Result:

The test passes

Actual Result:

The test sometimes fails as the values produced are not in the right order. They should be produced in the order in which they are defined in the regex.

Additional Context:

When producing non-random strings each state's transitions are sorted using state.getSortedTransitions(boolean to_first) and it looks like the issue is with this method sometimes returning transitions in reverse order.

I would like to see data generated that covers at least any equivalence partitions

As a user I would like to see data generated that covers at least any equivalence partitions related to the input profile.
E.g. if in the profile different thresholds are defined for delivery costs

Max weight | Cost
100g | £10.80
500g | £11.76
1kg | £13.32

Minimal data output should include the following values

99g | £10.80
100g | £10.80
101g | £11.76
499g | £11.76
500g | £11.76
501g | £13.32
999g | £13.32
1000g | £13.32

(No values over 1000g should be included)

Output "interesting" field values

Feature Request

Description of Problem:

For numeric/string types, we use IFieldValueSource::generateAllValues and apply a LimitingFieldValueSource to reduce the value set down. This means that for a range of 0 < X < 100, we either get 1-7 or 1-99. It would be better if we could deliberately pick boundary values - in this case, 1 and 99.

Potential Solutions:

Add a new method to IFieldValueSource that generates boundary (or otherwise interesting) values. Modify GenerationConfig to allow specifying what kind of data we want (minimum: sequential / boundary, but add random if trivial)

Introduce databag partitioning

Feature Request

Description of Problem:

We overgenerate, introducing performance issues and bloated datasets.

Potential Solutions:

Currently we convert decision trees to full rowspecs. Could we convert instead to partitioning trees, that can be subject to the same combination strategies we already use?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.