Dataset taken from Kaggle
Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.
- Time period: Donated to UCI ML 27 April 1987
- Inspiration
What types of machine learning models perform best on this dataset? Which features are most indicative of a poisonous mushroom?
Attribute Information: (classes: edible=e, poisonous=p)
- cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
- cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
- bruises: bruises=t,no=f
- odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
- gill-attachment: attached=a,descending=d,free=f,notched=n
- gill-spacing: close=c,crowded=w,distant=d
- gill-size: broad=b,narrow=n
- gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
- stalk-shape: enlarging=e,tapering=t
- stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
- stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- veil-type: partial=p,universal=u
- veil-color: brown=n,orange=o,white=w,yellow=y
- ring-number: none=n,one=o,two=t
- ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
- spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
- population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
- habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
'''
Importing the foundamental libraries and reading the dataset with Pandas
'''
import pandas as pd
import numpy as np
data = pd.read_csv("mushrooms.csv")
data.head()
class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | p | x | s | n | t | p | f | c | n | k | ... | s | w | w | p | w | o | p | k | s | u |
1 | e | x | s | y | t | a | f | c | b | k | ... | s | w | w | p | w | o | p | n | n | g |
2 | e | b | s | w | t | l | f | c | b | n | ... | s | w | w | p | w | o | p | n | n | m |
3 | p | x | y | w | t | p | f | c | n | n | ... | s | w | w | p | w | o | p | k | s | u |
4 | e | x | s | g | f | n | f | w | b | k | ... | s | w | w | p | w | o | e | n | a | g |
5 rows ร 23 columns
target = 'class' # The class we want to predict
labels = data[target]
features = data.drop(target, axis=1) # Remove the target class from the dataset
Since we have only categorical features, we cannot feed them directly into sklearn classifiers. The technique we are going to use is called One Hot Encoding and what it does is basically add a new binary feature for each value the categorical feature has.
In pandas we can use the get_dummies function:
categorical = features.columns # Since every fearure is categorical we use features.columns
features = pd.concat([features, pd.get_dummies(features[categorical])], axis=1) # Convert every categorical feature with one hot encoding
features.drop(categorical, axis=1, inplace=True) # Drop the original feature, leave only the encoded ones
labels = pd.get_dummies(labels)['p'] # Encode the target class, 1 is deadly 0 is safe
'''
Split the dataset into training and testing, the 80% of the records are in the trainig set
'''
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features,labels, test_size=0.2, random_state=0)
'''
Train predict pipeline
'''
from sklearn.metrics import fbeta_score, accuracy_score
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
'''
results = {}
start = time() # Get start time
learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
end = time() # Get end time
results['train_time'] = end - start
start = time() # Get start time
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train[:300])
end = time() # Get end time
results['pred_time'] = end - start
results['acc_train'] = accuracy_score(y_train[:300],predictions_train)
results['acc_test'] = accuracy_score(y_test,predictions_test)
results['f_train'] = fbeta_score(y_train[:300],predictions_train, beta=0.5)
results['f_test'] = fbeta_score(y_test,predictions_test, beta=0.5)
print "{} trained on {} samples.".format(learner.__class__.__name__, sample_size)
return results
We use three different model:
- Gaussian Naive Bayes
- Random Forests
- kNN
The results are stored in the results dictionary:
from time import time
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
clf_A = GaussianNB()
clf_B = RandomForestClassifier()
clf_C = KNeighborsClassifier()
training_length = len(X_train)
samples_1 = int(training_length * 0.01)
samples_10 = int(training_length * 0.1)
samples_100 = int(training_length * 1)
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = \
train_predict(clf, samples, X_train, y_train, X_test, y_test)
GaussianNB trained on 64 samples.
GaussianNB trained on 649 samples.
GaussianNB trained on 6499 samples.
RandomForestClassifier trained on 64 samples.
RandomForestClassifier trained on 649 samples.
RandomForestClassifier trained on 6499 samples.
KNeighborsClassifier trained on 64 samples.
KNeighborsClassifier trained on 649 samples.
KNeighborsClassifier trained on 6499 samples.
The best model is Random Forest classifier, that achieved 100% accuracy with the test set! We don't even need hype params. tuning:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
We can now print the most important features: The results show that the foul odor is the most discriminant feature!
z = zip(clf.feature_importances_,X_train.columns)
z.sort(reverse=True)
z[:10]
[(0.15844522158060251, 'odor_f'),
(0.072093232716836098, 'gill-size_n'),
(0.071449650799149014, 'ring-type_p'),
(0.059524344656014208, 'stalk-surface-below-ring_k'),
(0.054395896612936, 'gill-color_b'),
(0.053292416415563093, 'odor_n'),
(0.051462205469969005, 'stalk-root_e'),
(0.037758414413626332, 'odor_p'),
(0.037439645501368912, 'stalk-surface-above-ring_k'),
(0.033770321762183406, 'odor_c')]