Light

kventinel / hse-ml-project-mnist Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 14.4 MB

Python 0.31% Jupyter Notebook 92.55% TeX 7.14%

hse-ml-project-mnist's Introduction

Task

K-means

Choose 3 to 6 features. Explain the choice.
Apply K-means:
1. At K=5
2. At K=9
3. In both cases: 10 or more random initializations, choose the best solution over the K-means criterion; present it in a table.
Interpret each found partition by using features from the data table – as instructed in the lecture slides. Explain why you consider one of them better than the other in this perspective.

Bootstrap

Take one of the partitions found in the previous work.

Take a feature, find the 95% confidence interval for its grand mean by using bootstrap.
Compare the within-cluster means for one of the features between two clusters using bootstrap.
Take a cluster, and compare the grand mean with the within- cluster mean for the feature by using bootstrap.

Note: each application of bootstrap should be done in both, pivotal and non-pivotal, versions.

Contigency Table

Consider three nominal features (one of them, not more, may be taken from nominal features in your data).
Build two contingency tables over them: present a conditional frequency table and Quetelet relative index tables. Make comments on relations between categories of the common (to both tables) feature and two others.
Compute and visualize the chi-square- summary_Quetelet_index over both tables. Comment on the meaning of the values in the data analysis context.
Tell what numbers of observations would suffice to see the features as associated at 95% confidence level; at 99% confidence level.

PCA/SVD

In your data set, select a subset of 3-6 features related to the same aspect and explain your choice (may be the same subset that was used for k-means clustering).
Standardize the selected subset; compute its data scatter and SVD; determine contributions of all the principal components to the data scatter, naturally and per cent.
Compute and interpret a hidden ranking factor behind the selected features. The factor should be expressed in a 0-100 rank scale (as well as the features – ranking normalization).
Visualize the data using two first principal components at the standardization with two versions of normalization: (a) range normalization and (b) z-scoring. At these visualizations, use a distinct shape/color for points representing a pre-specified by you group of objects. Also, apply the conventional PCA for finding two first principal components and visualization; compare to the results at z-scoring. Comment on which of the normalizations is better and why.

Correlation Coefficient

Find two features in your dataset with more or less “linear-like” scatterplot.
Display the scatter-plot and comment how well it is suitable for building a linear regression.
Build a linear regression of one of the features over the other. Make a comment on the meaning of the slope.
Find the correlation and determinacy coefficients, and comment on the meaning of the latter.
Make a prediction of the target values for given two or three predictor’ values; make a comment.
Compare the mean relative absolute error of the regression on all points of your set and the determinacy coefficient and make comments.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.