Giter VIP home page Giter VIP logo

spark-learning's Introduction

Based on Apache Spark official docs

Link Status
https://spark.apache.org/docs/latest/quick-start.html
https://spark.apache.org/docs/latest/rdd-programming-guide.html#passing-functions-to-spark
https://spark.apache.org/docs/latest/sql-programming-guide.html

Configure Apache Spark on your local machine

How to submit spark job

spark-submit --class org.example.arydz.rdd.Main /home/arydz/workspace/learning/Spark-Learning/build/libs/spark-basic-0.1.jar

or

spark-submit --class org.example.arydz.sql.Main /home/arydz/workspace/learning/Spark-Learning/build/libs/spark-basic-0.1.jar

or

spark-submit --class org.example.arydz.sql.hive.Main /home/arydz/workspace/learning/Spark-Learning/build/libs/spark-basic-0.1.jar

MongoDB

Fill in later
https://github.com/vaquarkhan/springboot-microservice-apache-spark

https://www.bmc.com/blogs/mongodb-docker-container/ https://stackoverflow.com/questions/60522471/docker-compose-mongodb-docker-entrypoint-initdb-d-is-not-working https://medium.com/faun/managing-mongodb-on-docker-with-docker-compose-26bf8a0bbae3

The current js script is only for learning purposes, to force docker compose to init mongo db

Spark

RDD - Which Storage Level to Choose?

  • If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

  • If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

  • Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

  • Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.