nslatysheva / data_science_blogging Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 20.48 MB

Code and markdown files for blog posts

License: GNU General Public License v3.0

Jupyter Notebook 99.98% Python 0.02%

data_science_blogging's People

Contributors

Stargazers

Watchers

Forkers

chihchunchen ekochmar ejmurray

data_science_blogging's Issues

swap the CV section with the grid search section

I have swapped them in my working version and I think it flows better. We should discuss it though

illustrations that clarify the ensembling steps

one illustration could show the nested nature of the using an RF in ensembling, as it is an ensemle model itself

start post with wget from ICU

write a system command for people to download the dataset from the source

create new post on model optimization

stick the optimization code there and go into more detail about optimization tradeoffs and such

correlation between models

I think that would be a great plot to show why it would be a good idea to do the ensembling.

make crosslinks to other posts

for example when getting into random forests, we can refer to Guiseppe's post

use scikit-learn NN

instead of the custom multilayer perceptron

sort out randomized search stuff

I didn't quite have the time to figure out the randomized searching code, maybe have a look?

set seeds to keep post reproducible

new outline figure without the other two models

have a section where we introduce the algorithms we use

i think this could be at the top of the document with an explanation why it makes sense to mix these models in particular.

Implement k-fold cross-validation

Learning Activity 16: Implement k-fold cross-validation

knn3scores = cross_val_score(knn3, XTrain, yTrain, cv = 5)
print knn3scores
print "Mean of scores KNN3:", knn3scores.mean()

[ 0.85714286  0.8206278   0.85201794  0.87892377  0.86936937]
Mean of scores KNN3: 0.855616346648

knn99scores = cross_val_score(knn99, XTrain, yTrain, cv = 5)
print knn99scores
print "Mean of scores KNN99:", knn99scores.mean()

[ 0.85267857  0.83856502  0.82511211  0.9058296   0.87387387]
Mean of scores KNN99: 0.859211834352

XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state = 1) #seed 1

knn = KNeighborsClassifier()
n_neighbors = np.arange(3, 151, 2)

grid = GridSearchCV(knn, [{'n_neighbors':n_neighbors}], cv = 10)
grid.fit(XTrain, yTrain)
cv_scores = [x[1] for x in grid.grid_scores_]

train_scores = list()
test_scores = list()

for n in n_neighbors:
    knn.n_neighbors = n
    knn.fit(XTrain, yTrain)
    train_scores.append(metrics.accuracy_score(yTrain, knn.predict(XTrain)))
    test_scores.append(metrics.accuracy_score(yTest, knn.predict(XTest)))

plt.plot(n_neighbors, train_scores, c = "blue", label = "Training Scores")
plt.plot(n_neighbors, test_scores, c = "brown", label = "Test Scores")
plt.plot(n_neighbors, cv_scores, c = "black", label = "CV Scores")
plt.xlabel('Number of K nearest neighbors')
plt.ylabel('Classification Accuracy')
plt.gca().invert_xaxis()
plt.legend(loc = "upper left")
plt.show()

different optimisation strategies

grid
random
bayesian
gradient descent

check whether implemented