In this assignment will be learning Hive and Spark.
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers.Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Discuss Assingments
- Hive Tips and Bestpractices
- Query Analysis and Partitions
- Parallel execution
- Spark Tips and Bestpractices
- Serialization
- Creating and uploading a virtual environment -- Hive¶
- Generic Big data Parquet files ( demo only if a large dataset on cluster is available)
- Hive adds some more structure to data and let's you write HiveQL
- Spark is a fast and general engine for large-scale data processing.
- Hortonworks has an introductory tutorial.
- Get hands on experience with industry standards of Hive and Spark
- Handle problems with large amount of data.
- Go through an old Twitter deck on why Pig is good.