Giter VIP home page Giter VIP logo

dcc's Introduction

Deep Continuous Clustering

Introduction

This is a Pytorch implementation of the DCC algorithms presented in the following paper (paper):

Sohil Atul Shah and Vladlen Koltun. Deep Continuous Clustering.

If you use this code in your research, please cite our paper.

@article{shah2018DCC,
	author    = {Sohil Atul Shah and Vladlen Koltun},
	title     = {Deep Continuous Clustering},
	journal   = {arXiv:1803.01449},
	year      = {2018},
}

The source code and dataset are published under the MIT license. See LICENSE for details. In general, you can use the code for any purpose with proper attribution. If you do something interesting with the code, we'll be happy to know. Feel free to contact us.

Requirement

Pretraining SDAE

Note: Please find required files and checkpoints for MNIST dataset shared here.

Please create new folder for each dataset under the data folder. Please follow the structure of mnist dataset. The training and the validation data for each dataset must be placed under their respective folder.

We have already provided train and test data files for MNIST dataset. For example, one can start pretraining of SDAE from console as follows:

$ python pretraining.py --data mnist --tensorboard --id 1 --niter 50000 --lr 10 --step 20000

Different settings for total iterations, learning rate and stepsize may be required for other datasets. Please find the details under the comment section inside the pretraining file.

Extracting Pretrained Features

The features from the pretrained SDAE network are extracted as follows:

$ python extract_feature.py --data mnist --net checkpoint_4.pth.tar --features pretrained

By default, the model checkpoint for pretrained SDAE NW is stored under results.

Copying mkNN graph

The copyGraph program is used to merge the preprocessed mkNN graph (using the code provided by RCC) and the extracted pretrained features. Note the mkNN graph is built on the original and not on the SDAE features.

$ python copyGraph.py --data mnist --graph pretrained.mat --features pretrained.pkl --out pretrained

The above command assumes that the graph is stored in the pretrained.mat file and the merged file is stored back to pretrained.mat file.

DCC searches for the file with name pretrained.mat. Hence please retain the name.

Running Deep Continuous Clustering

Once the features are extracted and graph details merged, one can start training DCC algorithm.

For sanity check, we have also provided a pretrained.mat and SDAE model files for the MNIST dataset located under the data folder. For example, one can run DCC on MNIST from console as follows:

$ python DCC.py --data mnist --net checkpoint_4.pth.tar --tensorboard --id 1

The other preprocessed graph files can be found in gdrive folder as provided by the RCC.

Evaluation

Towards the end of run of DCC algorithm, i.e., once the stopping criterion is met, DCC starts evaluating the cluster assignment for the total dataset. The evaluation output is logged into tensorboard logger. The penultimate evaluated output is reported in the paper.

Like RCC, the AMI definition followed here differs slightly from the default definition found in the sklearn package. To match the results listed in the paper, please modify it accordingly.

The tensorboard logs for both pretraining and DCC will be stored in the "runs/DCC" folder under results. The final embedded features 'U' and cluster assignment for each sample is saved in 'features.mat' file under results.

Creating input

The input file for SDAE pretraining, traindata.mat and testdata.mat, stores the features of the 'N' data samples in a matrix format N x D. We followed 4:1 ratio to split train and validation data. The provided make_data.py can be used to build training and validation data. The distinction of training and validation set is used only for the pretraining stage. For end-to-end training, there is no such distinction in unsupervised learning and hence all data has been used.

To construct mkNN edge set and to create preprocessed input file, pretrained.mat, from the raw feature file, use edgeConstruction.py released by RCC. Please follow the instruction therein. Note that mkNN graph is built on the complete dataset. For simplicity, code (post pretraining phase) follows the data ordering of [trainset, testset] to arrange the data. This should be consistent even with mkNN construction.

Understanding Steps Through Visual Example

Generate 2D clustered data with

python make_data.py --data easy

This creates 3 clusters where the centers are colinear to each other. We would then expect to only need 1 dimensional latent space (either x or y) to uniquely project the data onto the line passing through the center of the clusters.

generated ground truth

Construct mKNN graph with

python edgeConstruction.py --dataset easy --samples 600

Pretrain SDAE with

python pretraining.py --data easy --tensorboard --id 1 --niter 500 --dim 1 --lr 0.0001 --step 300

You can debug the pretraining losses using tensorboard (needs tensorflow) with

tensorboard --logdir data/easy/results/runs/pretraining/1/

Then navigate to the http link that is logged in console.

Extract pretrained features

python extract_feature.py --data easy --net checkpoint_2.pth.tar --features pretrained --dim 1

Merge preprocessed mkNN graph and the pretrained features with

python copyGraph.py --data easy --graph pretrained.mat --features pretrained.pkl --out pretrained

Run DCC with

python DCC.py --data easy --net checkpoint_2.pth.tar --tensorboard --id 1 --dim 1

Debug and show how the representatives shift over epochs with

tensorboard --logdir data/easy/results/runs/DCC/1/ --samples_per_plugin images=100

Pretraining and DCC together in one script

See easy_example.py for the previous easy to visualize example all steps done in one script. Execute the script to perform the previous section all together. You can visualize the results, such as how the representatives drift over iterations with the tensorboard command above and navigating to the Images tab.

With an autoencoder, the representatives shift over epochs like: shift with autoencoder

dcc's People

Contributors

ilyak93 avatar lemonpi avatar shahsohil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dcc's Issues

two quetions

Hi
I have couple of questions:
1-in config.py

Fraction of "change in label assignment of pairs" to be considered for stopping criterion - 1% of pairs
__C.STOPPING_CRITERION = 0.001

isn't should be 0.01?

2-the lambda parameter is the λ coefficient in the paper?

Thank you

Stopping threshold bug?

The paper says we stop when change in assignment is below the stopping threshold, but the code implements this as:

            if change_in_assign > stopping_threshold:
                flag += 1
            if flag == 4:
                break

This is a bug right? It should be if change_in_assign < stopping_threshold:
I'll fix this in my pull request if so.

DCC example fails with matlab error

Running python DCC.py --data mnist --net checkpoint_4.pth.tar --tensorboard --id 1 results in this errors:

Traceback (most recent call last):
  File "DCC.py", line 335, in <module>
    main(args)
  File "DCC.py", line 85, in main
    trainset = DCCPT_data(root=datadir, train=True, h5=args.h5)
  File "/Users/charlieminns/Desktop/DCC-master/pytorch/custom_data.py", line 19, in __init__
    data = sio.loadmat(osp.join(root, 'traindata.mat'), mat_dtype=True)
  File "/Users/charlieminns/miniconda3/envs/py37/lib/python3.7/site-packages/scipy/io/matlab/mio.py", line 217, in loadmat
    MR, _ = mat_reader_factory(f, **kwargs)
  File "/Users/charlieminns/miniconda3/envs/py37/lib/python3.7/site-packages/scipy/io/matlab/mio.py", line 72, in mat_reader_factory
    mjv, mnv = get_matfile_version(byte_stream)
  File "/Users/charlieminns/miniconda3/envs/py37/lib/python3.7/site-packages/scipy/io/matlab/miobase.py", line 241, in get_matfile_version
    raise ValueError('Unknown mat file type, version %s, %s' % ret)
ValueError: Unknown mat file type, version 51, 55

Please advise.

download error

$ git clone https://github.com/shahsohil/DCC.git
Cloning into 'DCC'...
remote: Enumerating objects: 76, done.
remote: Total 76 (delta 0), reused 0 (delta 0), pack-reused 76
Unpacking objects: 100% (76/76), done.
Downloading data/mnist/pretrained.mat (227 MB)
Error downloading object: data/mnist/pretrained.mat (bb0b757): Smudge error: Error downloading data/mnist/pretrained.mat (bb0b757ef4918b0f218c0e8b7d530c613ca60d5d1d0a81c1c1c33a34642fa057): batch response: This repository is over its data quota. Purchase more data packs to restore access.

Errors logged to R:\GitHub\DCC.git\lfs\logs\20190225T132833.506642.log
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: data/mnist/pretrained.mat: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

Who knows how to solve this question?

Read/write issues for h5 files

Lines 32-34 of copyGraph.py are such that, if you use the command given in the README, your featurefile and outputfile will have the same name. But then you're reading frmo the same file with data0 that you're writing to with data2 so h5py throws an error:
"OSError: Unable to create file (unable to truncate a file which is already open)"

On a related note, was there any particularly compelling reason for storing the dataset from reuters as an h5 in make_data.py vs. just using scipy.io.savemat like you did with the other datasets?

Put another way, why not just put everything in a dictionary (e.g. data['X'] is a numpy array with the data features and data['Y'] is a numpy array with the labels) and then pickle it?

Too many clusters after DCC

Hey @shahsohil ,I'm trying to apply DCC to my own image dataset, and the results have too many clusters(1000 clusters for 3000 images), I'm wondering is there any way I can try to force the algorithm produce less clusters(for 5-10)? Could you please give me some suggestions? Thanks.

ERROR not create file checkpoint_2.pth.tar in easy/results/runs/ only created file checkpoint_0.pth.tar and checkpoint_1.pth.tar in easy/results/runs/

Untitled
1- !python2 make_data.py --data easy
2- !python2 edgeConstruction.py --dataset easy --samples 600
3- !python2 pretraining.py --data easy --tensorboard --id 1 --niter 500 --dim 1 --lr 0.0001 --step 300
4- !python2 extract_feature.py --data easy --net checkpoint_2.pth.tar --features pretrained --dim 1

Loaded easy dataset for finetuning
The endpoints are Delta1: 0.000, Delta2: 0.005
==> no checkpoint found at '/data/easy/results/checkpoint_2.pth.tar'
Traceback (most recent call last):
File "DCC.py", line 336, in
main(args)
File "DCC.py", line 120, in main
load_weights(args, outputdir, net)
File "DCC.py", line 221, in load_weights
raise ValueError
ValueError

Can not reproduce results except for mnist

I was successful in replicating the results using the provided MNIST dataset; however, I faced challenges in reproducing the outcomes with other datasets using default parameters. Specifically, I used the default structure of AutoEncoder as the code given on https://github.com/shahsohil/DCC. I followed the same training process as the tutorial on github shows. I used the same commands for pretrainning as those used for MNIST for other datasets, but the results were not as expected. Here is the ami results I got, the first is our result, the second it the reuslt on the paper: YTF: 0.69 | 0.88; Yale: 0.11 | 0.96; reuters: 0.02 | 0.57; RCV1: 0.04: 0.50.
What is the the training parameters used during your experiments?

Empty data folder on clone or download as zip

The MNIST data after cloning or downloading holds empty files - each file is only around 130 bytes and their content is actual text with something like:

version https://git-lfs.github.com/spec/v1
oid sha256:e60446c5fac6df3e3f37769ca5b51669a2da7d6a3a6abc9fc9a8cc2b4244a18d
size 26650347

So it's actually the file descriptor instead of the actual file.
It would be best to provide an alternative source for the whole MNIST data set including the checkpoints in addition to the .mat files that already exist.

For any future searches that encounters an error like

    magic_number = pickle_module.load(f)
_pickle.UnpicklingError: invalid load key, 'v'.

This is because the checkpoint is actually empty...

Ambiguity surrounding pretrained.mat

It's unclear to me from the documentation how pretrained.mat is supposed to be generated. pretraining.py takes in data/mydataset/traindata.mat and data/mydataset/testdata.mat and spits out data/mydataset/results/checkpoint_4.pth.tar such that when extract_feature.py takes in checkpoint_4.pth.tar it spits out a matrix of n_train + n_test. But are we then supposed to run RCC's edgeConstruction module on traindata.mat or testdata.mat or a combination of the two in order to produce pretrained.mat? If we do it on just one of them and then feed the resulting graph into copyGraph.py it'll throw a shape mismatch error...

Regarding error when running (DCC.py)

**Hello

Thank you for this project

When running (DCC.py), I face the following run-time error:

builtins.IndexError: tensors used as indices must be long, byte or bool tensors

Can you help me with it.

Clustering results in one cluster with 99.99% of the data

Hi.
I tried a lot of different hyper-parameter tuning and also all the data processioning according to the previous closed issues, but didn't manage to handle this issue. The results varied from a big number of clusters with always one dominant cluster with almost all of the data, and all other clusters with singletons or just a few examples of data.
With some hyper-parameters I've got a lot of almost empty clusters except the dominant and with other just a few, so only the number of clusters changed, but always remained one dominant cluster along the almost empty.
I tried also preproccesing which not included initially in the code as standard scaler. Tried also all of the mentioned in the code preproccesing methods and did it in the "make_data" step.
My data is a temporal data, i tried both architectures, with a little tweak to the convolutional:
made it 1d.
Attaching here the data heat maps before and after the normalizing:
image

image

I suspect that those architectures not useful for this data, otherwise I don't have explanation except that the data isn't separable, but that strange because even using simple Dimension Reductions techniques as PCA and plotting it with tSNE shows that there is some clusters.
Really hard issue according to the fact I tried everything except totally new architectures,

Cannot read the MNIST data

I have tried to load the MNIST mat files, got

Traceback (most recent call last):
  File "pretraining.py", line 226, in <module>
    main()
  File "pretraining.py", line 72, in main
    trainset = DCCPT_data(root=datadir, train=True, h5=args.h5)
  File "/home/pengx/workspace/DCC/pytorch/custom_data.py", line 19, in __init__
    data = sio.loadmat(osp.join(root, 'traindata.mat'), mat_dtype=True)
  File "/home/pengx/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/io/matlab/mio.py", line 141, in loadmat
    MR, file_opened = mat_reader_factory(file_name, appendmat, **kwargs)
  File "/home/pengx/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/io/matlab/mio.py", line 65, in mat_reader_factory
    mjv, mnv = get_matfile_version(byte_stream)
  File "/home/pengx/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/io/matlab/miobase.py", line 241, in get_matfile_version
    raise ValueError('Unknown mat file type, version %s, %s' % ret)
ValueError: Unknown mat file type, version 51, 55

I am using python2.7.15, spicy 1.0.0.
And I also try to load it in Matlab R2016a, got:

Unable to read MAT-file /Users/killandy/Code/DCC/data/mnist/traindata.mat. Not a binary MAT-file. Try load -ASCII to read as
text.

I've check the issue#2 which seems not suitable for me.

Thanks.

Gradients of U with respect to F (feature map)

Hey @shahsohil can you clarify how could I use the DCC output (Z or U representatives) to get a gradient of some future loss function L(Z) with respect to my feature transform parameters F(X|theta)? (I'm using Z in the diagram to match notation in the paper, but I'm actually using U)

My current understanding of the data flow is summarized by the flowchart below. The dashed arrows are routes where the gradient can back propagate. The green boxes hold parameters that requires gradients in the pytorch sense. The red dashed line for dF/dX means the gradient theoretically exists but the current implementation does not allow for it. Gradient with respect to the feature transform means with respect to the parameters of the feature transform (d/dF means d/dtheta)
Data flow - Page 3

After DCC I have representatives U that I then use in some later steps of the pipeline. I can get a gradient wrt U, but from the flowchart above there doesn't seem to be any way of propagating that back to F. The whole point of the pipeline is to learn the parameters for F, so the current architecture doesn't seem to work. One way to address this is to bring the later processes using U inside of the DCC loop as terms in the cost function. Do you have any ideas (and is my interpretation of the data flow wrong)?

IndexError in easy_example.py

Running easy_example.py without any changes results in an IndexError at line 74 in DCCComputation.py (error below). This seems to arise because the largest epsilon is greater than NOISE_THRESHOLD but smaller than DIM*NOISE_THRESHOLD.

epsilon = epsilon[np.where(epsilon / np.sqrt(cfg.DIM) > cfg.RCC.NOISE_THRESHOLD)]

pytorch version: 1.3.0.dev20190819
numpy version: 1.16.4
scipy version: 1.2.1

Loaded `easy` dataset for finetuning
/home/sxie22/miniconda3/envs/sisso/lib/python3.7/site-packages/numpy/lib/function_base.py:392: RuntimeWarning: Mean of empty slice.
  avg = a.mean(axis)
/home/sxie22/miniconda3/envs/sisso/lib/python3.7/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in true_divide
  ret = ret.dtype.type(ret / rcount)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/PycharmProjects/vanDuin/DCC/pytorch/easy_example.py in <module>
     86 args.M = 20
     87 args.lr = 0.001
---> 88 out = DCC.main(args, net=net)

~/PycharmProjects/vanDuin/DCC/pytorch/DCC.py in main(args, net)
    108 
    109     # computing and initializing the hyperparams
--> 110     _sigma1, _sigma2, _lambda, _delta, _delta1, _delta2, lmdb, lmdb_data = computeHyperParams(pairs, Z)
    111     oldassignment = np.zeros(len(pairs))
    112     stopping_threshold = int(math.ceil(cfg.STOPPING_CRITERION * float(len(pairs))))

~/PycharmProjects/vanDuin/DCC/pytorch/DCCComputation.py in computeHyperParams(pairs, Z)
     72     robsamp = min(cfg.RCC.MAX_NUM_SAMPLES_DELTA, robsamp)
     73     _delta2 = float(np.average(epsilon[:robsamp]) / 2)
---> 74     _sigma2 = float(3 * (epsilon[-1] ** 2))
     75 
     76     _delta1 = float(np.average(np.linalg.norm(Z - np.average(Z, axis=0)[np.newaxis, :], axis=1) ** 2))

IndexError: index -1 is out of bounds for axis 0 with size 0

In [2]: debug                                                                                        
> /home/sxie22/PycharmProjects/vanDuin/DCC/pytorch/DCCComputation.py(74)computeHyperParams()
     72     robsamp = min(cfg.RCC.MAX_NUM_SAMPLES_DELTA, robsamp)
     73     _delta2 = float(np.average(epsilon[:robsamp]) / 2)
---> 74     _sigma2 = float(3 * (epsilon[-1] ** 2))
     75 
     76     _delta1 = float(np.average(np.linalg.norm(Z - np.average(Z, axis=0)[np.newaxis, :], axis=1) ** 2))

ipdb> epsilon = np.linalg.norm(Z[pairs[:, 0].astype(int)] - Z[pairs[:, 1].astype(int)], axis=1)      
ipdb> epsilon = np.sort(epsilon)                                                                     
ipdb> epsilon[-1]                                                                                    
0.011799936
ipdb> np.sqrt(cfg.DIM)                                                                               
3.1622776601683795
ipdb> cfg.DIM                                                                                        
10

can we train and see the clusters with unlabeled data?

Hi

I am currently working on clustering of my custom textual data, in which i don't have any pre-defined labels.So i just have tried with code which is available,where i have done few changes in the code for not to consider labels. while making input for the training,put it gave below error.

File "pretraining.py", line 83, in main
'nepoch':nepoch, 'lrate':[args.lr], 'wdecay':[0.0], 'step':step}, use_cuda, trainloader, testloader)
File "pretraining.py", line 171, in pretrain
train(trainloader, net, index, optimizer, epoch, use_cuda)
File "pretraining.py", line 192, in train
outputs = net(inputs_Var, index)
File "/home/tiru/Desktop/topicmodel/topicmodel/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/tiru/Desktop/topicmodel/DCC/pytorch/SDAE.py", line 34, in forward
inp = x.view(-1, self.in_dim)
RuntimeError: invalid argument 2: size '[-1 x 2000]' is invalid for input with 324864 elements at /pytorch/torch/lib/TH/THStorage.c:37

So is it possible to train on text data which is not having any labels.if yes how can we train and test the same.

ModuleNotFoundError: No module named 'easydict'

Hello,
Thank you for sharing your project!

I have a problem when run easy_example.py, it can't find the module named 'easydict'. I have no idea to fix it. Would you please do me a favor?

ModuleNotFoundError: No module named 'easydict'

matlab

Does this project need MATLAB program?

Hyperparams and resulting numbers

Hi, thank you so much for your contribution and sharing your project! This is a great paper :)
I would highly appreciate your help with running the training process as I am not sure how to run certain parts of it in order to reproduce your results from the paper.

  1. How should we run the edgeConstruction script? Which hyperparams should we choose?
  2. Do you have available configurations for more datasets?
  3. Using the following commands I got fairly good numbers on MNIST, but lower than the reported ones.
    Can you perhaps guide me how to get closer to the reported numbers?

Script lines:
python pretraining.py --data mnist --id 1 --niter 50000 --lr 10 --step 20000
python extract_feature.py --data mnist --net checkpoint_4.pth.tar --features pretrained
python edgeConstruction.py --dataset mnist --format mat --samples 70000 --prep 'minmax' --k 10 --algo 'mknn'
python copyGraph.py --data mnist --graph pretrained.mat --features pretrained.pkl --out pretrained
python DCC.py --data mnist --net checkpoint_4.pth.tar --id 1
The results I got:
ARI: 0.830861826385 AMI: 0.7969629161498257 NMI: 0.8647221174121507 ACC: 0.8187 K: 173

Thank you so much in advance!!

Interpreting output as fuzzy clustering?

Hi I'm interested in using this algorithm as an intermediate step in a pytorch pipeline.
Therefore I need the output of clustering to be differentiable with respect to the input. The actual output of this algo is the position of the representatives, which then gets converted (non differentiably) to cluster assignments via connected components for those with distance below a threshold.

Do you see a way to relax the clustering assignment such that I can differentiate through it?
Ultimately the output has to be numbers rather than indices/labels, so I'm thinking a probability of being in a cluster? (but this contradicts the fact that we can't specify the number of clusters)

Simple visual example dataset processed end-to-end

It would be good to have a really simple and small data set that could be easily visualized (2D clusters like below) to train end-to-end. This would be helpful to me because it would clarify where pre-training and creating the mkNN graph comes in. I'm planning to create and work with such a data set then submit a pull request; are there any gotcha's I should be aware of?

image

Error in (Running Deep Continuous Clustering) step

Hello
Thank you very much for sharing your project

I am trying to re-implement it and as I go through this stage of the read.me file (Running Deep Continuous Clustering): I got the error below while running (DCCComputation.py file), Any advice?

**in computeHyperParams
_sigma2 = float(3 * (epsilon[-1] ** 2))

builtins.IndexError: index -1 is out of bounds for axis 0 with size 0**

Clustering problems

Hello,
thank you very much for sharing your project!

I'm trying to apply this algorithm on a set of RGB images (cartoons), in particular I have 2344 samples with dimension [227,227,3] composed by 7 classes. The algorithm is not able to correctly cluster the images, at the end I have ~ 0.2 ACC with 1220 clusters. I read carefully all the issues solved in this repository but I cannot solve my problem so I list each step that I did to have a feedback about a possibile mistake:

  1. I made my dataset using the file "make_data.py" using normalization [-1,1]. At the end I have testdata.mat and traindata.mat. Each row in this matrices is composed by the concatenation of the three channels, so I have [R,G,B] -> [51529,51529,51529] (51529=227x227). Considering together testdata.mat and traindata.mat I have a matrix 2344x154587.

  2. Next I run the "pretraining.py" file using --batch_size=256, --niter=1831 (in order to have 200 epochs as suggested), --step=733 (to have 80 epochs as suggested) --lr=0.01 (since the dimension of the data samples is higher than the other datasets used with this framework I though that this could be a good choice for mine), --dim=10.

  3. With the file checkpoint_4.pth.tar obtained after 2 I extract the features of the dataset obtaining "pretrained.pkl".

  4. I construct the graph with the original data using "edge_construction.py" with --algo knn, --k 10, --samples 2344 and I get "pretrained.mat" file.

  5. After I launch "copyGraph.py" to the final "pretrained.mat" file.

  6. Finally I use "DCC.py" leaving all the default values.

I tried also to use an higher k (k=20) and mknn instead of knn but the things seems not change.
Do you have any idea about the reason why the algorithm not work properly with my data?

Data Requirement

After reading yor code,I hava a question about the input data. Does the input data of the SADE network have to be labeled data?If so, isn't this still supervised learning?

Clarification of 'Z' and 'U'

Just to clarify, is 'Z' the representations after SDAE but before fine tuning with DCC and 'U' is after fine tuning with DCC? Got a little confused because in the paper it appears that 'Y' is the representations after SDAE and 'Z' are the representations after fine-tuning...

Problem to open MNIST data

Hi,
I am trying to repeat the result. When I am trying to open the data of MNIST provided, the error raised:

>>> data = sio.loadmat('testdata.mat', mat_dtype=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xsede/users/xs-ttgump/.local/lib/python3.6/site-packages/scipy/io/matlab/mio.py", line 141, in loadmat
    MR, file_opened = mat_reader_factory(file_name, appendmat, **kwargs)
  File "/home/xsede/users/xs-ttgump/.local/lib/python3.6/site-packages/scipy/io/matlab/mio.py", line 65, in mat_reader_factory
    mjv, mnv = get_matfile_version(byte_stream)
  File "/home/xsede/users/xs-ttgump/.local/lib/python3.6/site-packages/scipy/io/matlab/miobase.py", line 241, in get_matfile_version
    raise ValueError('Unknown mat file type, version %s, %s' % ret)
ValueError: Unknown mat file type, version 54, 50

It seems like the mat format provided by authors are not correct.
Thanks.

Three questions and suggests

I have some questions about the DCC method:

  1. I learned that in nowadays because of the use of dropout and ReLU, the layer-wise pretraining of autoencoder is not necessary (See the ReLU paper). If layer-wise pretraining can be skipped, it can save lots of time.
  2. Denoise autoencoder can improve the performance of clustering. The DCC model used dropout as denoising layers. But for some numerical data, in my experiences, such as the protein data in the RCC paper, it is better to add the Gaussian noises that can help the performance. So why don't use both of Gaussian noise and dropout?
  3. How to extract the learned clusters by DCC. I mean, after training DCC model, how can I extract the cluster assignment of each sample?
    Thanks!

a question about mknn

Hi @shahsohil , the work is very interesting! I have a question about the construction of mKNN graph. In the project, I find that you use the original data to measure the similarity and construct the mkNN graph. Is there a particular reason here? Why not use the latent representation feature of the pretrained AE for the graph? If I have a large image patch, e.g. 256x256 with multiple input bands, it will be a large computation cost for the creation of graph in the original image space.

Thank you.

Clustering result problem

Hi, thank you for your work. I applied this algorithm to my own data, but most of the data are divided into the first cluster. What is the cause of it, please? What kind of improvement do I need to do?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.