Giter VIP home page Giter VIP logo

group4-525's Introduction

DSCI 525 Group 4

UBC MDS Web and Cloud Computing project

The purpose of this project is to build and deploy ensemble machine learning models in the cloud to predict daily rainfall in Australia on a relatively large data set (~ 5.5 GB). The data can be found on figshare. We are using the output of different climate models as a feature and the actual rainfall recordings as a target.

There are four milestones for this project:

Milestone 1: Tackling big data on a laptop

  • Downloading the data using API
  • Combining data CSVs into a single CSV using DASK and Pandas libraries.
  • Loading the combined CSV to memory and perform a simple EDA
  • Performing a simple EDA in R

Team Members

Name Github ID
Heidi Ye @heidi-ye
Junting He @JuntingHe
Kamal Moravej @Kmoravej
Tanmay Sharma @tanmaysharma19

group4-525's People

Contributors

tanmaysharma19 avatar kmoravej avatar heidi-ye avatar juntinghe avatar

Watchers

James Cloos avatar

group4-525's Issues

Develop your API

  • create a new endpoint that accepts a POST request of the features required to run the machine learning model that you trained and saved in last milestone

Perform a simple EDA in R

  1. Pick an approach to transfer the dataframe from python to R.
  • Parquet file
  • Feather file
  • Pandas exchange
  • Arrow exchange
  1. Discuss why you chose this approach over others.

Data download

  1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
  2. Extract the zip file, again programmatically, similar to how we did it in class.

Project structure

Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Combining data CSVs

  1. Use one of the following options to combine data CSVs into a single CSV.
    • Pandas
    • DASK
  2. When combining the csv files make sure to add extra column called “model” that identifies the model (tip : you can get this column populated from the file name eg: for file name “SAM0-UNICON_daily_rainfall_NSW.csv”, the model name is SAM0-UNICON).
  3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Load the combined CSV to memory and perform a simple EDA

  1. Investigate at least two of the following approaches to reduce memory usage while performing the - - EDA (e.g., value_counts).
  • Changing dtype of your data
  • Load just columns what we want
  • Loading in chunks
  • Dask
  1. Discuss your observations.

Final Submission

In the textbox provided on Canvas for the Milestone 1 assignment include:

The URL of your public project's repository
The URL of your notebook for this milestone

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.