Dec 2021 | Data Analytics Project in TY BTech
I have chosen the the problem statement of classifying mushrooms as edible and poisonous based on the UCI Mushroom Dataset.
Some of the visualizations done to understand the dataset are:
- Number of poisonous and edible mushrooms in the dataset.
- Visualize the distribution between various habitats and the edibility of mushrooms.
- Population parameters of the instances of edible and poisonous mushrooms.
- Plot a treemap showing the distribution of different gill colors of mushrooms.
- Visualize the occurence of different ring types, and the number of rings on such mushrooms.
- Test the accuracy of various classification models on the dataset to build a accurate prediction model for edibility of mushrooms.
- Run the most accurate method of classification on an unseen dataset and cross-check itβs accuracy.
My approach to the problem is:
- Using OneHotEncoding techniques to enable efficient classification on this particular dataset.
- Applying Logistic Regression, Random Forest Classifier and Decision Tree algorithms to compare the accuracy.
- Using a K-Fold Cross Validation method to check the consistency and fitting score of the algorithms used.
- Choosing the approach with maximum accuracy and minimum standard deviation, and using it on the given test dataset to predict the classes of given mushrooms.