hands_on_approach_to_using_spark_and_hive_in_real_world's Introduction

Spark and Hive industry sessions

In this assignment will be learning Hive and Spark.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers.Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

At a glance

Discuss Assingments

Hive Tips and Bestpractices
Query Analysis and Partitions
Parallel execution
Spark Tips and Bestpractices
Serialization
Creating and uploading a virtual environment -- Hive¶
Generic Big data Parquet files ( demo only if a large dataset on cluster is available)

Pre Reads

Hive adds some more structure to data and let's you write HiveQL
Spark is a fast and general engine for large-scale data processing.
Hortonworks has an introductory tutorial.

Agenda

Get hands on experience with industry standards of Hive and Spark
Handle problems with large amount of data.

Post Reads

Go through an old Twitter deck on why Pig is good.

Recommend Projects

commit-live-students / hands_on_approach_to_using_spark_and_hive_in_real_world Goto Github PK

hands_on_approach_to_using_spark_and_hive_in_real_world's Introduction

Spark and Hive industry sessions

At a glance

Pre Reads

Agenda

Post Reads

hands_on_approach_to_using_spark_and_hive_in_real_world's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent