Section 1: Environment setup and data loading. 1. Go to Databricks community cloud and create a free account to get access to a single node Spark cluster: a. Go to https://databricks.com/try-databricks link and select the “COMMUNITY EDITION” (click on “get started” button). b. Provide all details and sign up. c. Verify your email id and select a password to get more information d. You can watch this video on YouTube to get started with Databricks community cloud. 2. Load the IPL dataset into the cluster a. Download the files needed to complete this assignment from here. Load the dataset into the cluster. This data set contains data of IPL Matches. b. We have sampled down the files and created a zipped file. You can download the zipped file and extract the file. c. Once you have downloaded the files, create a new Jupyter Notebook and import the data into the cluster. d. Create data frames for each of the data sets. Give proper column names and datatypes. (refer the schema provided with the data for reference e. Register those dataframes as tables. f. Please use pySpark or SparkSQL g. If you are new to Spark, refer to the Databricks and Spark documentation to learn about Notebooks, dataframes, loading data, etc. Below are some links that maybe useful for you to learn spark: i. https://docs.databricks.com/getting-started/spark/index.html ii. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html iii. https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and- operations Notes: Databricks notebook supports Mix languages. You can override the default language by specifying the language magic command % at the beginning of a cell. The supported magic commands are: %python, %SQL, %sh, and %md. Read Documentation. Section 2: Data analysis using SparkSQL or pySpark 1. Use SparkSQL or pySpark to analyze the data and answer the following questions. a. Write a query to find the highest extra runs given by a team in a match. b. Write a query to find the Leading wicket taker in the IPL? Note: A wicket won't be accounted for the bowler, if the dismissal is run out, obstructing the field. c. Write a query to return a report for highest run scorer in matches which were affected by Duckworth-Lewis’s method (D/L method). d. Write a query to return a report for highest strike rate by a batsman in powerplay (1-6 overs) Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure that balls faced by player should be legal delivery (not wide balls or no balls) e. Write a query to return a report for highest extra runs in a venue (stadium, city). f. Write a query to return a report for the cricketers with the most number of players of the match award in neutral venues. g. Write a query to get a list of top 10 players with the highest batting average Note: Batting average is the total number of runs scored divided by the number of times they have been out (Make sure to include run outs (on non-striker end) as valid out while calculating average). h. Write a query to find out who has officiated (as an umpire) the most number of matches in IPL. i. Find venue details of the match where V Kohli scored his highest individual runs in IPL. 2. Please refer to the schema reference excel sheet provided in the zipped to understand the relationship between the tables. Please write the queries in the same Notebook. Section 3: Expose Data. 1. Create a Python SQLlite database and load data from the dataset (Section 1.1) into database. 2. Create a Class Database. Class will need to have 1. Constructor to initialize dB connection and other variables 2. Methods implemented to return result-set for each query in Section 2.1, result-set should be returned as a json/dict object. 3. Exception handling. 4. get_status method to ping database connectivity. 3. Feel free to create any additional classes or data structures you deem necessary. 4. Evaluation: Here is an input example from database import Database db = Database () qry1_result = db.get_query1_result (); 5. Please follow industry standards while writing the code and include basic schema and data validations. [Reference 2.2] 6. Programming language – Python 7. The code has to be in a Jupyter notebook.
datagrokr's Introduction
datagrokr's People
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.