DATAGROKR

Section 1: Environment setup and data loading. 1. Go to Databricks community cloud and create a free account to get access to a single node Spark cluster: a. Go to https://databricks.com/try-databricks link and select the “COMMUNITY EDITION” (click on “get started” button). b. Provide all details and sign up. c. Verify your email id and select a password to get more information d. You can watch this video on YouTube to get started with Databricks community cloud. 2. Load the IPL dataset into the cluster a. Download the files needed to complete this assignment from here. Load the dataset into the cluster. This data set contains data of IPL Matches. b. We have sampled down the files and created a zipped file. You can download the zipped file and extract the file. c. Once you have downloaded the files, create a new Jupyter Notebook and import the data into the cluster. d. Create data frames for each of the data sets. Give proper column names and datatypes. (refer the schema provided with the data for reference e. Register those dataframes as tables. f. Please use pySpark or SparkSQL g. If you are new to Spark, refer to the Databricks and Spark documentation to learn about Notebooks, dataframes, loading data, etc. Below are some links that maybe useful for you to learn spark: i. https://docs.databricks.com/getting-started/spark/index.html ii. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html iii. https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and- operations Notes: Databricks notebook supports Mix languages. You can override the default language by specifying the language magic command % at the beginning of a cell. The supported magic commands are: %python, %SQL, %sh, and %md. Read Documentation. Section 2: Data analysis using SparkSQL or pySpark 1. Use SparkSQL or pySpark to analyze the data and answer the following questions. a. Write a query to find the highest extra runs given by a team in a match. b. Write a query to find the Leading wicket taker in the IPL? Note: A wicket won't be accounted for the bowler, if the dismissal is run out, obstructing the field. c. Write a query to return a report for highest run scorer in matches which were affected by Duckworth-Lewis’s method (D/L method). d. Write a query to return a report for highest strike rate by a batsman in powerplay (1-6 overs) Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure that balls faced by player should be legal delivery (not wide balls or no balls) e. Write a query to return a report for highest extra runs in a venue (stadium, city). f. Write a query to return a report for the cricketers with the most number of players of the match award in neutral venues. g. Write a query to get a list of top 10 players with the highest batting average Note: Batting average is the total number of runs scored divided by the number of times they have been out (Make sure to include run outs (on non-striker end) as valid out while calculating average). h. Write a query to find out who has officiated (as an umpire) the most number of matches in IPL. i. Find venue details of the match where V Kohli scored his highest individual runs in IPL. 2. Please refer to the schema reference excel sheet provided in the zipped to understand the relationship between the tables. Please write the queries in the same Notebook. Section 3: Expose Data. 1. Create a Python SQLlite database and load data from the dataset (Section 1.1) into database. 2. Create a Class Database. Class will need to have 1. Constructor to initialize dB connection and other variables 2. Methods implemented to return result-set for each query in Section 2.1, result-set should be returned as a json/dict object. 3. Exception handling. 4. get_status method to ping database connectivity. 3. Feel free to create any additional classes or data structures you deem necessary. 4. Evaluation: Here is an input example from database import Database db = Database () qry1_result = db.get_query1_result (); 5. Please follow industry standards while writing the code and include basic schema and data validations. [Reference 2.2] 6. Programming language – Python 7. The code has to be in a Jupyter notebook.

kishankumarthakur / datagrokr Goto Github PK

datagrokr's Introduction

DATAGROKR

datagrokr's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent