Giter VIP home page Giter VIP logo

datafaker's Introduction

Datafaker - Tool for faking data

License

Stargazers over time

English | 中文

1. Introduction

Datafaker is a large-scale test data and flow test data generation tool. It is compatible with python2.7 and python3.4+. Welcome to download and use. The github address is:

https://github.com/gangly/datafaker

Document sync updates on github

2. Background

In the software development testing process, test data is often needed. These scenarios include:

  • Backend development. After creating a new table, you need to construct database test data and generate interface data for use by the front end.
  • Database performance test. Generates a lot of test data to test database performance
  • Stream data test. For kafka streaming data, it is necessary to continuously generate test data to write to kafka.

After research, there is currently no open source test data generation tool for generating data with similar structure in mysql table. The common method is to manually create several pieces of data into the database. The disadvantage of this method is

  • Wasting work hours. Needs to construct different data for fields of different data types of the table
  • Small amount of data. If you need to construct a lot of data, you can't do it manually.
  • Not accurate enough. For example, you need to construct a mailbox (satisfying a certain format), a phone number (determined number of digits), an ip address (fixed format), age (cannot be negative, have a size range), and so on. These test data have certain restrictions or rules, and the manual construction may not meet the data range or some format requirements, resulting in the backend program error.
  • Multi-table association. The amount of data created manually is small, and the primary key in multiple tables may not be associated with, or associated with no data.
  • Dynamic random write. For example, for streaming data, you need to write kafka randomly every few seconds. Or dynamically insert mysql randomly, manual operation is relatively cumbersome, and it is not good to count the number of data written.

In response to these current pain points, datafaker came into being. Datafaker is a multi-data source test data construction tool that can simulate most common data types and easily solve the above pain points. Datafaker has the following features:

  • Multiple data types. Includes common database field types (integer, float, character), custom types (IP address, mailbox, ID number, etc.)
  • Simulate multi-table association data By formulating some fields as enumerated types (randomly selected from the specified data list), in the case of a large amount of data, it can ensure that multiple tables can be associated with each other and query data.
  • Support batch data and stream data generation, and specify stream data interval time
  • Support multiple data output methods, including screen printing, files and remote data sources
  • Support for multiple data sources. Currently supports relational databases, Hive, Kafka. Will be extended to Mongo, ES and other data sources.
  • Can specify the output format, currently supports text, json

3. Architecture

Datafaker is written in python and supports python2.7, python3.4+. The current version has been released on pypi.

architectur

The architecture diagram completely shows the execution process of the tool. From the figure, the tool has gone through five modules:

  • Parameter parser. Parse the commands that the user enters from the terminal command line.
  • Metadata parser. Users can specify metadata from local files or remote data source tables. After the parser obtains the content of the file, the text content is parsed into table field metadata and data construction rules according to the rules.
  • Data construction engine. The construction engine constructs rules based on the data generated by the metadata parser, simulating the generation of different types of data.
  • Data routing. According to different data output types, it is divided into batch data and stream data generation. Stream data can specify the frequency of generation. The data is then converted to a user-specified format for output to a different data source.
  • Data source adapter. Adapt to different data sources and import the data into the data source.

4. Installation

Method 1, install from source code:

Download the source code, unzip and install:

python setup.py install

Method 2, use pip:

pip install datafaker

Upgrade tool

pip install datafaker --upgrade

Uninstall tool

pip uninstall datafaker

Install require package

data source package note
mysql/tidb mysql-python/mysqlclient windows+python3 use mysqlclient
oracle cx-Oracle need some oracle lib
postgresql/redshift psycopg2
sqlserver pyodbc mssql+pyodbc://mssql-v
Hbase happybase,thrift
es elasticsearch
hive pyhive
kafka kafka-python

5. examples

usage example(使用举例)

6. command parameters

parameters detail(命令行参数)

7. construction rule

construction rule(构造规则)

8. note

note(注意事项)

9. Release note

Release note(发布记录)


Give a star or donate a coffee to the author

  • 给作者点个star或请作者喝杯咖啡

pay

datafaker's People

Contributors

gangly avatar moody1117 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.