The machine_learning from ronan-mch

machine_learning's Issues

Data visualization(s) based on suitable visualization techniques including a principal component analysis (PCA).

Touch upon the following subjects, use visualizations when it appears sensible. Keep in mind the ACCENT principles and Tufte’s guidelines when you visualize the data.

Are there issues with outliers in the data,
do the attributes appear to be normal distributed,
are variables correlated,
does the primary machine learning modeling aim appear to be feasible based on your visualizations.

There are three aspects that needs to be described when you carry out the PCA analysis for the report:

The amount of variation explained as a function of the number of PCA components included,
the principal directions of the considered PCA components (either find a way to plot them or interpret them in terms of the features),
the data projected onto the considered principal components.

If your attributes have very different scales it may be relevant to standardize the data prior to the PCA analysis.

A discussion explaining what you have learned about the data

Summarize here the most important things you have learned about the data and give also your thoughts on whether your primary modeling task(s) appears to be feasible based on your visualization.

A description of your data set

What the problem of interest is (i.e. what is your data about),
Where you obtained the data,
What has previously been done to the data. (i.e. if available go through some of the original source papers and read what they did to the data and summarize what were their results).
What the primary machine learning modeling aim is for the data, i.e. which attributes you feel are relevant when carrying out a classification, a regression, a clustering, an association mining, and an anomaly detection in the later reports and what you hope to accomplish using these techniques. For instance, which attribute do you wish to explain in the regression based on which other attributes? Which class label will you predict based on which other attributes in the classification task? If you need to transform the data to admit these tasks, explain roughly how you might do this (but don’t transform the data now!).

A detailed explanation of the attributes of the data

Describe if the attributes are discrete/continous, Nominal/Ordinal/Interval/Ratio,
Give an account of whether there are data issues (i.e. missing values or corrupted data) and describe them if so
Describe the basic summary statistics of the attributes.

If your data set contains many similar attributes, you may restrict yourself to describing a few representative features (apply common sense).

Regression

Explain which regression problem you have chosen to solve.
Apply linear regression with forward selection and consider if transforming or combining attributes potentially may be useful. For linear regression, plotting the residual error vs. the attributes can give some insight into whether including a transformation of a variable can improve the model, i.e. potentially describe parts of the residuals.
Explain how a new data observation is predicted according to the estimated model. I.e. what are the effects of the selected attributes in terms of predicting the data. (Notice, if you interpret the magnitude of the estimated coefficients this in general requires that each attribute be normalized prior to the analysis.).
Fit an artificial neural network (ANN) model to the data.
Statistically evaluate if there is a significant performance difference between the fitted ANN and linear regression models based on the same cross-validation splits (i.e., use a paired t-test). Compare in addition if the performance of your models are better than simply predicting the output to be the average of the training data output.

Classification

Explain which classification problem you have chosen to solve.
Apply at least three of the following methods: Decision Trees, Logistic/Multinomial Regression, K-Nearest Neighbors (KNN), Naı̈ve Bayes and Artificial Neural Networks (ANN). (Use cross-validation to select relevant parameters in an inner cross-validation loop and give in a table the performance results for the methods evaluated on the same cross-validation splits on the outer cross-validation loop, i.e. you should use two levels of cross-validation).
For the models you are able to interpret explain how a new data observation is classified.
(If you have multiple models fitted, (i.e., one for each cross-validation split) either focus on one of these fitted models or consider fitting one model for the optimal setting of the parameters estimated by cross-validation to all the data.)
Statistically compare the performance of the two best performing models (i.e., use a paired t-test). Compare in addition if the performance of your models are better than simply predicting all outputs to be the largest class in the training data.

ronan-mch / machine_learning Goto Github PK

machine_learning's People

Contributors

Watchers

machine_learning's Issues

Data visualization(s) based on suitable visualization techniques including a principal component analysis (PCA).

A discussion explaining what you have learned about the data

A description of your data set

A detailed explanation of the attributes of the data

Regression

Classification

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent