Giter VIP home page Giter VIP logo

eks-spark-benchmark's Introduction

This repository contains benchmark results and best practice to run Spark workloads on EKS.

Spark on Kubernetes

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Kubernetes has first class support on Amazon Web Services and Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service.

Traditionally, data processing workloads have been run in dedicated setups like the YARN/Hadoop stack. However, unifying the control plane for all workloads on Kubernetes simplifies cluster management and can improve resource utilization. Engineers across several companies and organizations have been working on Kubernetes resource manager support as a cluster scheduler backend within Spark.

Starting with Spark 2.3, users can run Spark workloads in an existing Kubernetes cluster and take advantage of Apache Spark’s ability to manage distributed data processing tasks.

In order to run large scale spark applications in Kubernetes, there's still a lots of performance issues in Spark 2.4 or 3.0 we'd like users to know. This repo will talk about these performance optimization and best practice moving Spark workloads to Kubernetes.

Getting Started

To run TPC-DS benchmark on EKS cluster, please follow instructions

Performance Hits and Optimizations

TPC-DS Benchmark

Created by a third-party committee, TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions. According to its own homepage (https://www.tpc.org/tpcds/), it defines decision support systems as those that examine large volumes of data, give answers to real-world business questions, execute SQL queries of various operational requirements and complexities (e.g., ad-hoc, reporting, iterative OLAP, data mining), and are characterized by high CPU and IO load.

This benchmark includes 104 queries that exercise a large part of the SQL 2003 standards – 99 queries of the TPC-DS benchmark, four of which with two variants (14, 23, 24, 39) and “s_max” query performing a full scan and aggregation of the biggest table, store_sales.

We can evaluate and measure the performance of Spark SQL using the TPC-DS benchmark on Kubernetes (EKS) and Apache Yarn (CDH).

q64-v2.4, q70-v2.4, q82-v2.4 are very representative and typical. They are Network shuffle, CPU, I/O intensive queris. From the result, we can see performance on Kubernetes and Apache Yarn are very similar. kubernetes wins slightly on these three queries. In total, 10 iterations of the query have been performed and the median execution time is taken into consideration for comparison.

  • Spark on Kubernetes fetch more blocks from local rather than remote. Executors fetch local blocks from file and remote blocks need to be fetch through network. Fetch blocks locally is much more efficient compare to remote fetching.

  • Spark on Yarn seems take more time on JVM GC. Frequent GC will block executor process and have a big impact on the overall performance. We have not checked number of minor gc vs major gc, this need more investigation in the future. Memory management is also different in two different resource managers. Kubernetes request spark.executor.memory + spark.executor.memoryOverhead as total request and limit for executor pods, every pod has its own os cache space inside the container. While, Apache Yarn monitors pmem and vmem of containers and have system shared os cache.

  • Spark on Kubernetes uses more time on shuffleFetchWaitTime and shuffleWriteTime. Looks like executors on Kubernetes take more time to read and write shuffle data. Data locality is not available in Kubernetes, scheduler can not make decision to schedule workers in a network optimized way. They's probably the reason it takes longer on shuffle operations.

There're 68% of queries running faster on Kubernetes, 6% of queries has similar performance as Yarn. In total, our benchmark results shows TPC-DS queries against 1T dataset take less time to finish, it save ~5% time compare to YARN based cluster.

Credits and Thanks

  • @steveloughran gives a lot of helps to use S3A staging and magic committers and understand zero-rename committer deeply.

  • @moomindani help on the current status of S3 support for spark in AWS.

  • TPC-DS and TeraSort is pretty popular in big data area and there're few existing solutions. Some code snippets in the repo come from @Kisimple and @cern.

References

eks-spark-benchmark's People

Contributors

jeffwan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.