osda-assignment's Introduction

Lazy FCA

A big homework to merge Lazy Learning and Formal Concept Analysis. Created for the course Ordered Sets in Data Analysis (GitHub repo) taught in Data Science Master programme in HSE University, Moscow.

Task description

Intention of the homework

This homework serves as an entry point for students into the data science world.
So it introduces students to three main topics:

Typical machine learning project:
The pipeline of loading a dataset, feature engineering, designing a new predictive algorithm, results comparison;
Lazy learning:
Predicting labels for small or rapidly changing data;
Rule-learning (on part with FCA):
Viewing data as binary descriptions of objects (instead of points in a space of real numbers).

To-do list

Bare minimum (4 pts.)
- Find a dataset for binary classification:
  Ideal dataset would be: openly available, with various datatypes (numbers, categories, graphs, etc), with hundreds of rows;
- Describe scaling (binarization) strategy for the dataset features,
- Describe prediction quality measure best suited for the dataset
  (e.g. accuracy, f1 score, or any measure that you find reasonable),
- Adapt the pipeline from lazyfca.ipynb to your task.
Good solution (5-8 pts.):
- Improve the baseline algorithm:
  - Achieve better asymptotic time complexity;
  - Improve the runtime of examples comparison:
    Rewrite the intersections of sets into the intersection of the corresponding bitmasks. Would it make the algorithm faster?
    (numpy, bitarray, and other python packages may be of help);
  - Modify the algorithm to achieve better quality of predictions.
- Or simply design a new lazy algorithm to beat the baseline
  in terms of algorithmic time complexity, runtime, and prediction quality;
- Compare the proposed algorithm with popular rule-based models
  i.e. decision tree, random forest, XGBoost, CatBoost in terms of runtime and prediction quality.
Perfect solution (9-10 pts.)
- Incorporate pattern structures into the pipeline to avoid the scaling (binarization);
- Test the proposed algorithm on more datasets (at least 3 others)

Intro to baseline algorithm

Binary classification setup

This homework focuses on the task of binary classification. That is, we are given the data $X = \{x_1, x_2, x_3, ...\}$ and the corresponding labels $Y$ where each label $y \in Y$ is either True of False. The task is to make a "prediction" $\hat{y}$ of a label $y \in Y$ for each datum $x \in X$ as if $y$ is unknown.

To estimate the quality of predictions we split the data $X$ into two non-overlapping sets $X_{train}$ and $X_{test}$. Then we make make a prediction $\hat{y}$ for each test datum $x \in X_{test}$ based on the information obtained from the training data $X_{train}$. Finally, we measure how well predictions $\hat{y}$ represent the true test labels $y$.

The original data $X$ often comes in various forms: numbers, categories, texts, graphs, etc. So to make all these kinds of data look the same we scale (binarize) the dataset. Now we can say that there are a set of attributes $M$ where each attribute $m \in M$ is either describes $x$ or not. That is, we replace each $x \in X$ with its binary analogue $x_{bin} \subseteq M$.

Baseline algorithm

The following algorithm is used in the homework as a baseline algorithm for lazy FCA-based classification. And it is called "Generators framework".

Assume that we want to make a prediction for description $x \subseteq M$ given the set of training examples $X_{train} \subseteq 2^M$ and the labels $y_x \in \{False, True\}$ corresponding to each $x \in X_{train}$.

First, we split all examples $X_{train}$ to positive $X_{pos}$ and negative $X_{neg}$ examples: $$X_{pos} = \{x \in X_{train} \mid y_x \textrm{ is True} \}, \quad X_{neg} = X \setminus X_{pos}.$$

To classify the descriptions $x$ we follow the procedure:

Count the number of counterexamples for positive examples
For each positive example $x_{pos} \in X_{pos}$ we compute the intersection $x \cap x_{pos}$. Then, we count the counterexamples for this intersection, that is the number of negative examples $x_{neg} \in X_{neg}$ containing intersection $x \cap x_{pos}$;
Dually, count the number of counterexamples for negative examples
For each negative example $x_{neg} \in X_{neg}$ we compute the intersection $x \cap x_{neg}$. Then, we count the counterexamples for obtained intersection, that is the number of positive examples $x_{pos} \in X_{pos}$ containing intersection $x \cap x_{neg}$.

Finally, we compare the average number of counterexamples for positive and negative examples. We classify $x$ as being positive if the number of counterexamples for positive examples is smaller the one for negative examples.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

kartikj360 / osda-assignment Goto Github PK

osda-assignment's Introduction

Lazy FCA

Task description

Intention of the homework

To-do list

Intro to baseline algorithm

Binary classification setup

Baseline algorithm

osda-assignment's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent