Giter VIP home page Giter VIP logo

datagrokr's Introduction

DATAGROKR

Section 1: Environment setup and data loading. 1. Go to Databricks community cloud and create a free account to get access to a single node Spark cluster: a. Go to https://databricks.com/try-databricks link and select the “COMMUNITY EDITION” (click on “get started” button). b. Provide all details and sign up. c. Verify your email id and select a password to get more information d. You can watch this video on YouTube to get started with Databricks community cloud. 2. Load the IPL dataset into the cluster a. Download the files needed to complete this assignment from here. Load the dataset into the cluster. This data set contains data of IPL Matches. b. We have sampled down the files and created a zipped file. You can download the zipped file and extract the file. c. Once you have downloaded the files, create a new Jupyter Notebook and import the data into the cluster. d. Create data frames for each of the data sets. Give proper column names and datatypes. (refer the schema provided with the data for reference e. Register those dataframes as tables. f. Please use pySpark or SparkSQL g. If you are new to Spark, refer to the Databricks and Spark documentation to learn about Notebooks, dataframes, loading data, etc. Below are some links that maybe useful for you to learn spark: i. https://docs.databricks.com/getting-started/spark/index.html ii. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html iii. https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and- operations Notes: Databricks notebook supports Mix languages. You can override the default language by specifying the language magic command % at the beginning of a cell. The supported magic commands are: %python, %SQL, %sh, and %md. Read Documentation. Section 2: Data analysis using SparkSQL or pySpark 1. Use SparkSQL or pySpark to analyze the data and answer the following questions. a. Write a query to find the highest extra runs given by a team in a match. b. Write a query to find the Leading wicket taker in the IPL? Note: A wicket won't be accounted for the bowler, if the dismissal is run out, obstructing the field. c. Write a query to return a report for highest run scorer in matches which were affected by Duckworth-Lewis’s method (D/L method). d. Write a query to return a report for highest strike rate by a batsman in powerplay (1-6 overs) Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure that balls faced by player should be legal delivery (not wide balls or no balls) e. Write a query to return a report for highest extra runs in a venue (stadium, city). f. Write a query to return a report for the cricketers with the most number of players of the match award in neutral venues. g. Write a query to get a list of top 10 players with the highest batting average Note: Batting average is the total number of runs scored divided by the number of times they have been out (Make sure to include run outs (on non-striker end) as valid out while calculating average). h. Write a query to find out who has officiated (as an umpire) the most number of matches in IPL. i. Find venue details of the match where V Kohli scored his highest individual runs in IPL. 2. Please refer to the schema reference excel sheet provided in the zipped to understand the relationship between the tables. Please write the queries in the same Notebook. Section 3: Expose Data. 1. Create a Python SQLlite database and load data from the dataset (Section 1.1) into database. 2. Create a Class Database. Class will need to have 1. Constructor to initialize dB connection and other variables 2. Methods implemented to return result-set for each query in Section 2.1, result-set should be returned as a json/dict object. 3. Exception handling. 4. get_status method to ping database connectivity. 3. Feel free to create any additional classes or data structures you deem necessary. 4. Evaluation: Here is an input example from database import Database db = Database () qry1_result = db.get_query1_result (); 5. Please follow industry standards while writing the code and include basic schema and data validations. [Reference 2.2] 6. Programming language – Python 7. The code has to be in a Jupyter notebook.

datagrokr's People

Contributors

kishankumarthakur avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.