Giter VIP home page Giter VIP logo

flink-sql-benchmark's Introduction

flink-sql-benchmark

TPC-DS benchmark

Generate test hive dataset

  • Step 1: Prepare your environment

    Make sure you have Hadoop and Hive installed in your cluster. gcc is also needed to build the TPC-DS data generator.

  • Step 2: Build the data generator

    cd hive-tpcds-setup

    Run ./tpcds-build.sh

    Download and build the TPC-DS data generator.

  • Step 3: Generate TPC-DS dataset

    cd hive-tpcds-setup

    Run ./tpcds-setup.sh 10000. The hive database is tpcds_bin_orc_10000.

    Run ./tpcds-setup.sh <SCALE_FACTOR> to generate dataset. The "scale factor" represents how much data you will generate, which roughly translates to gigabytes. For example, ./tpcds-setup.sh 10 will generate about 10GB data. Note that the scale factor must be greater than 1.

    tpcds-setup.sh will launch a MapReduce job to generate the data in text format. By default, the generated data will be placed in /tmp/tpcds-generate/<SCALE_FACTOR> of your HDFS cluster. If the folder already exists, the MapReduce job will be skipped.

    Once data generation is completed, tpcds-setup.sh will load the data into Hive tables. Make sure the hive executable is in your PATH, alternatively, you can specify your Hive executable path via HIVE_BIN environment variable.

    tpcds-setup.sh will create external Hive tables based on the generated text files. These tables reside in a database named tpcds_text_<SCALE_FACTOR>. Then tpcds-setup.sh will convert the text tables into an optimized format and the converted tables are placed in database tpcds_bin_<FORMAT>_<SCALE_FACTOR>. By default, the optimized format is orc. You can choose a different format by setting the FORMAT environment variable. The following is an example that creates 1TB test dataset in parquet format:

    FORMAT=parquet HIVE_BIN=/path/to/hive ./tpcds-setup.sh 1000

    Once the data is loaded into Hive, you can use database tpcds_bin_<FORMAT>_<SCALE_FACTOR>to run the benchmark.

Run benchmark in flink

  • Step 1: Prepare your flink environment.

  • Step 2: Build test jar.

    • Modify flink version and hive version of pom.xml.

    • mvn clean install

  • Step 3: Run

    • flink_home/bin/flink run -c com.ververica.flink.benchmark.Benchmark ./flink-tpcds-0.1-SNAPSHOT-jar-with-dependencies.jar --database tpcds_bin_orc_10000 --hive_conf hive_home/conf
    • optional --location: sql queries path, default using queries in jar.
    • optional --queries: sql query names. If the value is 'all', all queries will be executed. eg: 'q1.sql'.
    • optional --iterations: The number of iterations that will be run per case, default is 1.
    • optional --parallelism: The parallelism, default is 800.

Run benchmark in other systems

Because the prepared test data is standard hive data, other calculation frameworks integrated with hive data can also run benchmark very simply. Please build your own environment and test it.

If you have any questions, please contact:

flink-sql-benchmark's People

Contributors

jingsongli avatar lirui-apache avatar stephanewen avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.