nlesc-sherlock / cluster-analysis Goto Github PK

Shell 0.01% Java 1.66% Cuda 0.31% OpenEdge ABL 0.67% Python 0.92% R 0.06% Jupyter Notebook 87.44% TeX 8.93%

cluster-analysis's Introduction

Cluster Analysis

See the Wiki for more details.

About this repository

The data directory contains quite a few large files, so be sure to use a stable connection when checking out the entire repository.

There are four new datasets containing similarity cores between images in the Dresden image database. These sets are named after one of the cameras in their respective datasets: pentax, praktica, canon, and fuji.

To check out the canon and fuji sets you may need to first install git large file storage, see git lfs. Also you will need bzip2 to unpack the files.

cluster-analysis's People

Contributors

Stargazers

Watchers

Forkers

jspaaks sonjageorgievska sherlock-clustering abhishekhp2016 davisan

cluster-analysis's Issues

Create a simple GPU implementation for the NCC computation

The NFI already created a highly optimized GPU implementation. Therefore this is something that is not essential. But since we do not have their source code, it would be nice to have a working implementation of our own. In case we have to recompute some of the NCC scores or aim to compute scores for bigger datasets. And nonetheless it is very nice exercise for anybody who wants to get more experienced with GPU Computing.

Create a GPU implementation for PCE without FFTs, if possible

Depending on the outcome of the work in Issue #3 . We want implement a new version of the PCE that does not need FFTs.

Write clustit documentation in the README

The tool needs more documentation (linking to LargeVIs, Dive etc) so that paper reviewers can easily reproduce the results if they want to.

Investigate layered clustering, say first on camera brand, then model, then physical copies

This point is basically a continuation of Issue #10. First we need to get more experienced with hierarchical clustering. Then we can see if it possible to create a final clustering in which we basically use multiple threshold to differentiate between different physical copies of the same camera and cameras that are from different brands. The idea is that the inter-cluster distance will be much larger for cameras that are from different brands. And that therefore we cannot use a single threshold in a large dataset with many different cameras, of which some belong to the same brand and/or model.

Create a faster GPU kernel for the NCC implementation

The block-tiled implementation enables data reuse in GPU memory, but we can also reuse data on-chip. The next step is to write a kernel that does exactly this.

Investigate the overhead of the PrnuPatternCache and improve its performance

Because image sets can get so big that not all PRNU patterns fit in memory at the same time, I've created a simple LRU cache for PRNU patterns. The current implementation is very simple and it's a good idea to see what we can do to improve its performance.

See if we can somehow use T-SNE and other new clustering algorithms

The patterns that we are comparing are very large so it is a lot easier to have the clustering algorithms work only with the output of one of the similarity metrics, be it an edgelist or a matrix. If you know of any other clustering algorithms that can work like that, it is certainly worth trying them out.

Compute similarity scores for multiple rotations of the patterns

This is a feature that would be great to have. First will be important to also compare flipped patterns, to detect pictures that were created with the camera upside down. Supporting other rotations would be nice to have as well, but significantly more work.

Do a qualitative comparison of the similarity metrics

We currently do not know which similarity metrics is the 'best'. Perhaps it differs per dataset or per clustering algorithm that you use afterwards. What are the benefits of using one metric over the other? It would be very good if someone could look into this.

Investigate the applicability of heuristics to our problem

Currently we compute similarity scores between all the images in a dataset. This means that if there are N images in the set, we currently need to make (N over 2) comparisons. The idea is that using some heuristic, we can drastically reduce the required number of comparisons.

Investigate if there is a 'best' way to convert similarity to distance for our similarity metrics

I've briefly tried a couple of different ways to convert our similarity metrics to distance metrics. The reason that we need this is because most Python libraries for clustering expect a distance matrix rather than a similarity matrix. I would be great if someone could investigate why and how certain conversions work better than others and which one is best for each metric. This is currently implemented in the function convert_similarity_to_distance() within camera_identification.py

important _notice

This notice

https://github.com/nlesc-sherlock/cluster-analysis/blob/master/clustit/important_notice.txt

should be incorporated in a better way. I mean - there should be an option to use clustit without having to install LargeVis (if someone wants to try different clustering algorithms on already embedded data).

Use hierarchical clustering using scipy.cluster.hierarchy to cluster the patterns

The program camera_identification.py currently uses scipy.cluster.hierarchy to create a dendogram. A flat clustering is extracted from the dendogram using a particular threshold. The problem is that the threshold to use is related to the expected distances between the clusters in the dataset. Ideally we would like to have some way of dynamically determining the threshold to use. This can be a lot work, as it could mean that we have to implement our own algorithm for computing the linkage. In the end that means we are implementing our own hierarchical clustering algorithm.

Create a command-line Python tool for clustering

After discussing with @bakhshir and @arnikz, we decided that it was a good idea to start building a command-line tool for clustering data based on an edgelist or distance matrix. It would be nice to start leaving the exploratory fase and start working more towards a tool that can serve as an end product of the sprint sessions. For the time being, lets call the tool 'clustit'.

The current idea is that clustit could be used as follows:
./clustit <name of data file> <name of clustering algorithm> [optional parameters]

A tool like this should also make it easier to experiment with different clustering algorithms. Clustering algorithms to support are hierarchical clustering (from scipy.hierarchical.cluster) and flat clustering algorithms (from sklearn).

Data files to support are edgelist files and distance matrices. If you pass a distance matrix, it should be possible to also supply a list of names.

What the output of the tool will look file is not entirely set in stone yet, perhaps a list of names with an ID for the cluster to which they belong, or a list of clusters with all the names of the nodes belonging to that cluster, something like that.

Use flat clustering algorithms from sklearn to cluster the patterns

There are many clustering algorithms in sklearn, we need to investigate which of these we can use for our application. Then we can also make a comparison of the quality of the flat clusterings produced by the different clustering algorithms. This is a lot of work because each clustering algorithms has different parameters that need to be tuned to our problem case to get good results.

See if we can do without FFTs in the PCE computation

I am suspecting that the FFTs that are used within the PCE computations are actually unnecessary. Currently the pattern are only real-valued and I think it may be possible to compute a cross correlation of two real-valued signals without using FFTs, but I would like someone with more knowledge on the subject to look into this.

release clustit with Zenodo

so that it has a DOI number, so that I can cite it in the paper

Use docopt instead of argparse for CLI

Implement the heuristics and final clustering algorithm into the Java program

Once we have figured out how to use heuristics in combination with some clustering algorithm it will be time to try to implement this in the Java application. This issue of course depends on Issue #6.

Find out what the best metric is to evaluate the clusterings that we create

At the end of camera_identification.py we currently print a couple of metrics to evaluate the flat clustering produced by the program. We need to investigate which metrics are actually best to evaluate the clusterings produced by our application. Currently, we print the adjusted rand score, adjusted mutual info score, and the homogeneity, completeness, and v-measure scores using the metrics module of sklearn. I also think that, given the application, a low false positive rate is also very important.

Create a block-tiled version of the NCC computation

The current NCC computation does not use any loop-tiling and therefore has a bad access pattern on the PRNU pattern cache.

use DirectBuffers for CPU-GPU data transfers

DirectBuffers are the way to use pinned memory in Java and improve CPU-GPU data transfer rates