ostegm / dat_sf_10 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kebaler/dat_sf_10

0.0 0.0 0.0 75.37 MB

Repository for data science 10 course

Python 100.00%

dat_sf_10's People

Contributors

Watchers

dat_sf_10's Issues

Homework 4 Review

Looks great, nice work!
@kebaler @ghego @craigsakuma

Homework 2 review

Hey Otto,

Overall I think you did a good job with the homework.

I think you did a great job with getting rid of the NaN in one go with:

df = pd.read_csv('iris.csv', names = column_labels, nrows =150)

In case you don't have it in your toolkit - an alternative way to get rid of the NaN is to use df.dropna(). For the mapping I saw that you first added an empty class_labels column and then assigned a numerical value to it by using replace

df['Class_label'] = df['Class']
df.Class_label.replace(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'],[1,2,3],inplace=True),

An alternative way and I tried it was to create a dictionary since x is set to each key in the dictionary, not each value and used the map function as map works element-wise on a Series, so I had

data['Name_mapped'] = data['Name'].map(lambda x: name_dict[x])

I think one can also use the applymap method too and it should all give the same result.

The KNN and Kfold seem fine to me , except to covert the dataframe into an array, I used the values attribute

X = data[['col_1','col_2','col_3','col_4']].values
y = data['Name_mapped'].values

Also you're not alone in getting the deprecation warning error:). I got it too and I'll try to get help with it before class today.

I learned some interesting ways while reviewing your homework. Hope this helped. If you have any q, feel free to email me.

-Priya

@ghego, @craigsakuma, @kebaler

HW6 Review

Hey Otto,

Good stuff. You definitely put a lot of work into the column conversion! It was interesting that the bank-additional version didn't improve your CV score much. An interesting piece of the dataset is the "duration" column which the UCI dataset library tends to be overly predictive (0 = no, always) and is only gathered after the call has occurred.

One small error: you accidentally used a Gaussian estimator (which was from the original learning curve code on the site) which explains why your learning curve CV score for the Random Forest is significantly lower than your original score.

-Justin

Note on PCA

Hey Otto,

I checked out your HW3 since I was curious about your PCA implementation (and I think the student I was assigned dropped out of the course). Cool stuff and I'd forgotten about that stats package. A couple things that could improve your model would be normalizing the scales so that certain features don't overpower your data. Great idea putting class mean times into the spots for the data. I treated them all equally, which was probably a bad idea.

For PCA specifically, you were really close, actually. You had it implemented, in a way, by doing the analysis and seeing how to handle the variance. Now here is another reason why scaling is important: PCA is very sensitive to the scale, since it is consolidating variance and you don't want it to overcompensate for certain variables (at least that's my understanding of it). The next step would be to choose fewer columns for the n components based on your scree plot and then use that for your linear regression. By reducing the number of dimensions, you'll also improve your model's accuracy (remember that distance between points increases like crazy with higher dimensionality).

Best,
Justin

@ghego @craigsakuma @kebaler

ostegm / dat_sf_10 Goto Github PK

dat_sf_10's People

Contributors

Watchers

dat_sf_10's Issues

Homework 4 Review

Homework 2 review

HW6 Review

Note on PCA

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent