ubc-mds / 525_group10 Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 6.06 MB

License: MIT License

Jupyter Notebook 64.20% Python 0.07% HTML 35.73%

525_group10's People

Contributors

Watchers

525_group10's Issues

Team-work contract

https://docs.google.com/document/d/1_FoS7qASMOxAM9HmKmgOGSF-iroOtydWcAJU_Wf-Fjw/edit#

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

Setup the server

Login in to the server (instance). The person who spins up the EC2 instance will only have access to the server as he only got the private key. If someone else wants to log in to that instance, you need to get hold of that private key.
Setup a common data folder to download data, and this folder should be accessible by all users in the JupyterHub. Following commands make a folder and make it accessible to everyone. Want to learn more about basic UNIX commands?
sudo mkdir -p /srv/data/my_shared_data_folder
sudo chmod 777 /srv/data/my_shared_data_folder/
If you want a sharing notebook environment, then check out this. if you plan to do this, make sure you install the "members" package in your server run sudo apt-get install members."
Install AWS CLI. More details here.

NOTE:We are installing this in our EC2 instance, but we can install this anywhere to interact with s3. Say you can install it in your local machine and move data to s3.

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install

Setup your access key and secret. Do it from your AWS console. Make sure you keep your "Access key ID" & secret key somewhere safe.
Use these credentials to configure AWS CLI (aws configure). More details here. "Default region" and "output format" you can leave empty.
AWS cli can be used to interact with a lot of services. Check this out. To get a feel, we will use CLI to interact with s3 and wait for step Wrangle the data in preparation for machine learning

Please attach this screen shots from your group for grading
Make sure you mask the IP address refer here

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/3_result.png

Deploy your API

Once your API (app.py) is working we're ready to deploy it! For this, do the following:

SSH into your EC2 instance from milestone2. There are no issues if you want to spin another EC2 instance; if you plan to do so, make sure you terminate any other running instances.
Make a file app.py file in your instance and copy what you developed above in there.

2.1 You can use the linux editor using vi. More details on vi Editor here. I do recommend doing it this way and knowing some basics like :wq,:q!,dd will help.

2.2 Or else you can make a file in your laptop called app.py and copy it over to your EC2 instance using scp. Eg: scp -r -i "ggeorgeAD.pem" ~~/Desktop/worker.py [email protected]:~~/

Download your model from s3 to your EC2 instance.

Presumably you already have pip or conda installed on your instance from your previous milestone. You should use one of those package managers to install the dependencies of your API, like flask, joblib, sklearn, etc.

4.1. You have installed it in your TLJH using Installing pip packages. if you want to make it available to users outside of jupyterHub (which you want to in this case as we are logging into EC2 instance as user ubuntu by giving ssh -i privatekey ubuntu@<host_name>) you can follow these instructions.

4.2. Alternatively you can install the required packages inside your terminal.

Install conda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Install packages (there might be others):
conda install flask scikit-learn joblib
Now you're ready to start your service, go ahead and run flask run --host=0.0.0.0 --port=8080. This will make your service available at your EC2 instance's IP address on port 8080. Please make sure that you run this from where app.py and model.joblib resides.

You can now access your service by typing your EC2 instances public IPv4 address appened with :8080 into a browswer, so something like http://<your_EC2_ip>:8080.
You should use curl to send a post request to your service to make sure it's working as expected.

EG: curl -X POST http://your_EC2_ip:8080/predict -d '{"data":[1,2,3,4,53,11,22,37,41,53,11,24,31,44,53,11,22,35,42,53,12,23,31,42,53]}' -H "Content-Type: application/json"

Now, what happens if you exit your connection with the EC2 instance? Can you still reach your service?

There are several options we could use to help us persist our server even after we exit our shell session. We'll be using screen. screen will allow us to create a separate session within which we can run flask and which won't shut down when we exit the main shell session. Read this to learn more on screen.
Now, create a new screen session (think of this as a new, separate shell), using: screen -S myapi. If you want to list already created sessions do screen -list. If you want to get into an existing screen -x myapi.
Within that session, start up your flask app. You can then exit the session by pressing Ctrl + A then press D. Here you are detaching the session, once you log back into EC2 instance you can attach it using screen -x myapi.
Feel free to exit your connection with the EC2 instance now and try accessing your service again with curl. You should find that the service has now persisted!

Setup your EC2 instance

Please attach this screen shots from your group for grading.

Perform a simple EDA in R

1. Pick an approach to transfer the dataframe from python to R.
Parquet file
Feather file
Pandas exchange
Arrow exchange
2. Discuss why you chose this approach over others.

Setup your browser , jupyter environment & connect to the master node

add screenshot

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3.ipynb

2.1) Under cluster summary > Application user interfaces > On-cluster user interfaces: Click on Enable an SSH Connection.

2.2) From instructions in the popup from Step 2.1, use: Step 1: Open an SSH Tunnel to the Amazon EMR Master Node. Remember you are running this from your laptop terminal, and after running, it will look like this.

2.3) From instructions in the popup from Step 2.1, please ignore Step 2: Configure a proxy management tool. Instead follow instructions given here, under section Example: Configure FoxyProxy for Firefox:. Get foxyproxy standard here

2.4) Move to application user interfaces tab, use the jupytetHub URL to access.

2.4.1) Username: jovyan, Password :jupyter. These are default more details here

2.5)[ OPTIONAL ] Remember, we are using EMR managed jupyterHub, and the setup they have is different from TLJH. So before you add users in jupyterHub, run this by SSHing into the master node. Follow the instruction cluster summary > Connect to the Master Node Using SSH. Remember, you are running this from your laptop terminal. Once you get inside the server/instance, add your team members.

sudo docker exec jupyterhub useradd -m -s /bin/bash -N
sudo docker exec jupyterhub bash -c "echo : | chpasswd"

2.6) Login into the master node from your laptop terminal (cluster summary > Connect to the Master Node Using SSH), and install necessary packages. Here are needed packages based on the solution that I have; you might have to install other packages depending on your approach.

sudo yum install python3-devel
sudo pip3 install pandas
sudo pip3 install s3fs

IMPORTANT: Make sure ssh -i ~/ggeorgeAD.pem -ND 8157 [email protected] is running in your terminal window before trying to access your jupyter URL. Sometimes the connection might lose; in that case run that step again to access your jupyterHub.

Please attach this screen shots from your group for grading¶
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/images/Task2.png

Setup your EMR cluster

add screenshot

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3.ipynb

Follow the instructions shown during the lecture to set up your EC2 instance. Please make sure you follow the below instructions.

1.1) Go to advanced options.

1.2) Choose Release 6.2.0.

1.3) Check JupyterHub 1.1.0 & Spark 3.0.1.

1.4) Core instances to be 0, master 1.

1.5) Root device EBS volume size 30 GB.

1.6) Cluster name :

1.7) Uncheck Termination protection.

1.6) Add tag, enter "Owner" under the Key field. In the Value field in the Name row, give your

1.9) Select your keypair what you have used in your previous milestone (milestone 2).

1.10) EC2 security group go with the default. Remember this is a managed service, what we learned from the shared responsibility model, and hence AWS will take care of many things. EMR comes in the list of container services.

1.11) Wait for the cluster to start. This takes around ~15 min. Wait for your cluster status to be Waiting .

Please attach this screen shots from your group for grading¶
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/images/Task1.png

Task 3

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task3.ipynb

Setup your S3 bucket and move data

Screenshot of S3 console with 3 objects/folder in view
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/images/4_result.png
Save output file to s3 as "output/ml_data_SYD.csv"

Get the data what we wrangled in our first milestone.

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/Milestone2.ipynb

Obtain best hyperparameter settings using spark's MLlib

add notebook to github repo

Upload this notebook to your jupyterHub (AWS managed jupyterHub in cluster) you setup in Task 2 and follow instruction given there. https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task4.ipynb

Load the combined CSV to memory and perform a simple EDA

1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
Changing dtype of your data
Load just columns what we want
Loading in chunks
Dask
2. Discuss your observations.

Specific expectations for this milestone - In the Notebook

In this milestone, we are looking for well-documented and self-explanatory notebook exploring different options to tackle big data on your laptop.
Discuss any challenges or difficulties you faced when dealing with this large data on your laptops. Briefly explain your approach to overcome the the challenges or reasons why you were not able to overcome them.

Submission

Submission instructions

In the textbox provided on Canvas for the Milestone 2 assignment include:

The URL of your public project's repository
The URL of your notebook for this milestone
Link to screenshots folder?

Task 4

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task4.ipynb

Comments

What is the shape of df_modelling (sanity check)?

Downloading the data

1. Download the data fromfigshare to your local computer using the figshare API (you can make use of requests library).
2. Extract the zip file, again programmatically, similar to how we did it in class.

You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

Code Review

Hey folks,

It would be great to get a second pair of eyes on both the task 3 and task 4 notebooks prior to submission.

Here is the assignment of reviewers:

Task 3 - Ela, Mo
Task 4 - Dustin, Kaicheng

Comments

File names in your Github repository could be improved.
The notebook needs to be better organized: you could have kept the rubric from the wording.
It is hard to follow what's being done as there are very few comments on the code, and very little analysis of your observations/results.
Comparison of performances between the different computers of your group?

Wrangle the data in preparation for machine learning

Refactor code
Add explanations of few steps

Setup your JupyterHub

Screenshot required:

Please attach this screen shots from your group for grading

Comments

Missing "security and access" in Task1-EMR-screenshot.png. Could you please add it?

Summarize your journey from Milestone 1 to Milestone 4

There is no format or structure on how you write this. (also, no minimum number of words). It's your choice on how well you describe it.

Combining data CSVs

1. Use one of the following options to combine data CSVs into a single CSV.
Pandas
DASK
2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

Develop your API

You probably got how to set up primary URL endpoints from the sampleproject.ipynb notebook and have them process and return some data. Here we are going to create a new endpoint that accepts a POST request of the features required to run the machine learning model that you trained and saved in last milestone (i.e., a user will post the predictions of the 25 climate model rainfall predictions, i.e., features, needed to predict with your machine learning model). Your code should then process this data, use your model to make a prediction, and return that prediction to the user. To get you started with all this, I've given you a template which you should fill out to set up this functionality:

NOTE: You won't be able to test the flask module (or the API you make here) unless you go through steps in 2. Deploy your API. However, here you can make sure that you develop all your functions and inputs properly.

Develop a ML model using scikit-learn

https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task3.ipynb

Upload this notebook to your jupyterHub (TLJH in your EC2) from your previous milestone and follow instruction given there. https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task3.ipynb

Creating repository and project structure

Similar to previous project courses, create a public repository under UBC-MDS org for your project.

Write brief introduction of the project in the README.
Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

ubc-mds / 525_group10 Goto Github PK

525_group10's People

Contributors

Watchers

525_group10's Issues

Recommend Projects

Recommend Topics

Recommend Org