Giter VIP home page Giter VIP logo

cluster-analysis's Introduction

Cluster Analysis

See the Wiki for more details.

About this repository

The data directory contains quite a few large files, so be sure to use a stable connection when checking out the entire repository.

There are four new datasets containing similarity cores between images in the Dresden image database. These sets are named after one of the cameras in their respective datasets: pentax, praktica, canon, and fuji.

To check out the canon and fuji sets you may need to first install git large file storage, see git lfs. Also you will need bzip2 to unpack the files.

cluster-analysis's People

Contributors

anandgavai avatar arnikz avatar bakhshir avatar benvanwerkhoven avatar hannospreeuw avatar isazi avatar sonjageorgievska avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-analysis's Issues

Create a simple GPU implementation for the NCC computation

The NFI already created a highly optimized GPU implementation. Therefore this is something that is not essential. But since we do not have their source code, it would be nice to have a working implementation of our own. In case we have to recompute some of the NCC scores or aim to compute scores for bigger datasets. And nonetheless it is very nice exercise for anybody who wants to get more experienced with GPU Computing.

Investigate layered clustering, say first on camera brand, then model, then physical copies

This point is basically a continuation of Issue #10. First we need to get more experienced with hierarchical clustering. Then we can see if it possible to create a final clustering in which we basically use multiple threshold to differentiate between different physical copies of the same camera and cameras that are from different brands. The idea is that the inter-cluster distance will be much larger for cameras that are from different brands. And that therefore we cannot use a single threshold in a large dataset with many different cameras, of which some belong to the same brand and/or model.

See if we can somehow use T-SNE and other new clustering algorithms

The patterns that we are comparing are very large so it is a lot easier to have the clustering algorithms work only with the output of one of the similarity metrics, be it an edgelist or a matrix. If you know of any other clustering algorithms that can work like that, it is certainly worth trying them out.

Compute similarity scores for multiple rotations of the patterns

This is a feature that would be great to have. First will be important to also compare flipped patterns, to detect pictures that were created with the camera upside down. Supporting other rotations would be nice to have as well, but significantly more work.

Do a qualitative comparison of the similarity metrics

We currently do not know which similarity metrics is the 'best'. Perhaps it differs per dataset or per clustering algorithm that you use afterwards. What are the benefits of using one metric over the other? It would be very good if someone could look into this.

Investigate the applicability of heuristics to our problem

Currently we compute similarity scores between all the images in a dataset. This means that if there are N images in the set, we currently need to make (N over 2) comparisons. The idea is that using some heuristic, we can drastically reduce the required number of comparisons.

Investigate if there is a 'best' way to convert similarity to distance for our similarity metrics

I've briefly tried a couple of different ways to convert our similarity metrics to distance metrics. The reason that we need this is because most Python libraries for clustering expect a distance matrix rather than a similarity matrix. I would be great if someone could investigate why and how certain conversions work better than others and which one is best for each metric. This is currently implemented in the function convert_similarity_to_distance() within camera_identification.py

Use hierarchical clustering using scipy.cluster.hierarchy to cluster the patterns

The program camera_identification.py currently uses scipy.cluster.hierarchy to create a dendogram. A flat clustering is extracted from the dendogram using a particular threshold. The problem is that the threshold to use is related to the expected distances between the clusters in the dataset. Ideally we would like to have some way of dynamically determining the threshold to use. This can be a lot work, as it could mean that we have to implement our own algorithm for computing the linkage. In the end that means we are implementing our own hierarchical clustering algorithm.

Create a command-line Python tool for clustering

After discussing with @bakhshir and @arnikz, we decided that it was a good idea to start building a command-line tool for clustering data based on an edgelist or distance matrix. It would be nice to start leaving the exploratory fase and start working more towards a tool that can serve as an end product of the sprint sessions. For the time being, lets call the tool 'clustit'.

The current idea is that clustit could be used as follows:
./clustit <name of data file> <name of clustering algorithm> [optional parameters]

A tool like this should also make it easier to experiment with different clustering algorithms. Clustering algorithms to support are hierarchical clustering (from scipy.hierarchical.cluster) and flat clustering algorithms (from sklearn).

Data files to support are edgelist files and distance matrices. If you pass a distance matrix, it should be possible to also supply a list of names.

What the output of the tool will look file is not entirely set in stone yet, perhaps a list of names with an ID for the cluster to which they belong, or a list of clusters with all the names of the nodes belonging to that cluster, something like that.

Use flat clustering algorithms from sklearn to cluster the patterns

There are many clustering algorithms in sklearn, we need to investigate which of these we can use for our application. Then we can also make a comparison of the quality of the flat clusterings produced by the different clustering algorithms. This is a lot of work because each clustering algorithms has different parameters that need to be tuned to our problem case to get good results.

See if we can do without FFTs in the PCE computation

I am suspecting that the FFTs that are used within the PCE computations are actually unnecessary. Currently the pattern are only real-valued and I think it may be possible to compute a cross correlation of two real-valued signals without using FFTs, but I would like someone with more knowledge on the subject to look into this.

Find out what the best metric is to evaluate the clusterings that we create

At the end of camera_identification.py we currently print a couple of metrics to evaluate the flat clustering produced by the program. We need to investigate which metrics are actually best to evaluate the clusterings produced by our application. Currently, we print the adjusted rand score, adjusted mutual info score, and the homogeneity, completeness, and v-measure scores using the metrics module of sklearn. I also think that, given the application, a low false positive rate is also very important.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.