README
PAPER
This is the implementation of this paper: Deep Learning via Semi-supervised Embedding by Jason Weston
Remote Data Description
>>> x = sio.loadmat('PaviaU.mat')
>>> x.keys()
dict_keys(['paviaU', '__globals__', '__header__', '__version__'])
>>> x['paviaU'].shape
(610, 340, 103)
>>> x['paviaU'][0]
array([[ 647, 499, 464, ..., 3221, 3238, 3250],
[ 604, 546, 527, ..., 2442, 2464, 2528],
[ 621, 746, 556, ..., 2308, 2345, 2361],
...,
[ 524, 525, 598, ..., 2936, 2940, 2950],
[ 528, 330, 70, ..., 2737, 2774, 2829],
[ 711, 483, 423, ..., 2562, 2539, 2488]], dtype=uint16)
>>> x = sio.loadmat('PaviaU_gt.mat')
>>> x.keys()
dict_keys(['__header__', '__version__', '__globals__', 'paviaU_gt'])
>>> x['paviaU_gt'].shape
(610, 340)
>>> x['paviaU_gt']
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[2, 2, 2, ..., 0, 0, 0],
[2, 2, 2, ..., 0, 0, 0],
[2, 2, 2, ..., 0, 0, 0]], dtype=uint8)
Read the pdf problem collection generated by classeur.
Semi-supervised Experiment Problem Collection
Summary
Compared with traditional classifiers, semi-supervised learning take the advantage the internel structure of data, labeled and unlabeled. The foundamental presumption of semi-supervised learning is that data within similar structure are likely to have same label. Based on this, scientists have developed different algorithms to capture such similarity amonge labeled and unlabeled data. The basic design of semi-supervised loss is the combination labeled loss with unlabeled loss .
Deep learning via semi-supervised embedding proposed by Weston, Mobahi and Collobert uses the deep neural network to capture the feature, or embed the data. The advantage of this algorithm is that deep neural network has strong ability of feature mapping, thus the embedding for unlabeled data would be more efficient. Deep learning algorithm prepares different losses for labeled and unlabeled data. A big difference between this algorithm and others is that the unlabeled loss does not simply remove the label, but use embedding algorithm. In the experiment, we found that different embedding algorithms would greatly influence the result.
Besides, we add cluster loss term to the loss. Originally the loss was designed as a combination of supervised loss and embedding algorithm based on manifold or graph(neighbor). To make the computation cheaper, we select the neighbor based on pixel instead of k nearest. The introduce of cluster loss like S3VM for neighbor data would reinforce the presumption. According to the experiment result, the model prefers cluster loss to embedding algorithm.
Semi-supervised discriminative Boltzmann machine was proposed by Larochelle and Bengio. Semi-supervised DRBM loss obey the basic design, it is a compond of labeled and unlabeled losses. Different from deep learning which uses embedding algorithm inside the mode, DRBM simply removes the label infomation from the log loss. This architecture makes the contrastive divergence approximation slightly different supervised DRBM.
Label propagation is a graph-based semi-supervised proposed by Bengio. This algorithm builds a graph with nodes representing the labeled and unlabeled data while edges describe the similarity between them.
The structure of semi-supervised ladder network proposed by Rasmus is an autoencoder with skip connections from the encoder to decoder. The learning task is similar to that in denoising autoencoders ut applied to every layer, not just the inputs. This architecture would make it easier for the higher layers to represent the details.
Deep Learning via Embedding Model
Describe Dataset in JSON
It's very annoying to adapt the model for new dataset when code lines like filename and tf.placeholder
shape have to be modified every time. So writing json
configuration file would be a better idea than changing the code mannually in source code or command line input.
Blind To Classifications
Try normalization anyway:
from sklearn import preprocessing
fn = '/home/yangminz/Semi-supervised_Embedding/dataset/PaviaU.mat'
dset = sio.loadmat(fn)
x = dset['paviaU'].reshape(shape_[0]*shape_[1], shape_[2])
x = preprocessing.scale(x)
x = x.reshape(shape_)
Taxing Coumputation
A profusion of computation lies in finding neighbors, or
The paper Deep Learning via Semi-supervised Embedding provides the solution or substitution for such arduous computation in section 3.1 Labeling Unlabeled Data as Neighbors(Building the Graph):
However, there are also many other ways of collecting neighboring unlabeled data that do not involve computing
$k$ -nn. For example, if one has access to unlabeled sequence data the following tricks can be used.
The trick is to take advantage of the physical properties of data itself. Here for paviaU
dataset, we pick the actual neighbor of a 'pixel' as its neighbor. Without distance computation and ordering, the computation is much faster.
According to the paper, the tricks can be subsumed into the following conditions by their datasets:
- image
- text
- speech
Difficult to Perform Batch Computation for Loss
To impletement loss with unlabeled samples, it's import to combine supervised loss with embedding loss. One can consult this stackoverflow post to figure out how to impletement embedding loss with unlabeled samples. There are some tricks for
Notice that it is difficult to impletement this embedding loss within a batch. Because a batch of data in tf.placeholder
cannot get [x[0], x[1], x[2]]
, then we must do something like:
for i in [0, 1, 2]:
for j in [0, 1, 2]:
W[i][j] = get_weight(x, i, j)
But as far as I know, it seems that this is impossible to impletement with Tensorflow
because it requires indexing within tf.placeholder
.
Fortunately the algoritm in the paper actually did not impletement batch form like:
Instead, it picks a pair of labeled, random neighbor, random unlabeled at every epoch. This makes it much more easier to impletement. It sounds like not fast. But with the new design of neighboring, the computation is not slow at all.
And remember to make the loss to scalar by tf.reduce_sum
or tf.reduce_mean
.
Weight Reusing
Originally, the code was designed like:
def forward(x):
y = Dense(128, activation='tanh')(x)
y = Dense(1, activation='sigmoid')(y)
return y
y1 = forward(x1)
y2 = forward(x2)
However, weights would not be reused in this way because everytime the script calls forward
, a new layer would be allocated when building up the graph. As a result, we get 2 seperate networks.
Layer initialization should be put outside the function:
layers = [
Dense(128, activation='tanh'),
Dense(1, activation='sigmoid')]
def forward(x):
y = x
for layer in layers:
y = layer(y)
y1 = forward(x1)
y2 = forward(x2)
It's much more convinent since the structure of the network can easily be adjusted and extended in list
like:
layer_3 = layer_1[:5] + layer_2[4:]
Such implemention is possible by Keras
.
nan
Problem
Distance and In EmbedNN experiment with g50c
dataset, the model sometimes converges to nan
when use different distance in unlabeled embedding. For Laplacian Eigenmaps,
can work while Euclidean distance:
would give nan
. There seems some delicate math backend.
And Siamese Networks embedding function: s $$ L(f_i, f_j , W_{ij}) = \max(0, m - ||f_i - f_j||2)^2, \text{ if } W{ij} = 0 $$
can work when
S3VM loss
Standard optimization for S3VM is:
Here we use the final unlabeled term in the combined loss which contains neighbor embedding loss based on manifold(graph) and this S3VM loss:
Note that here
A little experiment here can detect the model preference between $ L_{\text{neighbor}}$ and 100000
labeled data with absolute hinge loss and Siamese Networks embedding:
Alpha | 0.0 | 1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Beta | 0.0 | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 |
Accuracy | 66.08% | 39.16% | 36.64% | 45.52% | 14.35% | 48.74% | 41.80% | 41.46% | 51.82% | 56.18% | 63.75% | 66.14% |
Correct Distinguishing, Error Labeling
The classifier can distinguish classes from each other. However, the corresponding labels are misclassified. For example, [2, 2, 4, 3, 3]
would be classified as [1, 1, 6, 2, 2]
. They are correctly distinguished but wrongly classified.
The test data is the one to blame. One should be extremly careful when prepare "one-hot" labeling otherwise such weird error would occur.
For example, the following function
def OneHot(data, width):
tmp = np.zeros((data.size, width))
tmp[np.arange(data.size), data] = 1
return tmp
returns one at their indices, [0, 1, 0, 0]
for data=1
and width=4
. Sometimes similar function would give one-hot at their counting position, [1, 0, 0, 0]
. Here data=1
means the first digit rather than index 1
.
Improve the Accuracy
Tested with paviaU
.
- Change Embedding Algorithm Lapalacian Eigenmaps seems to work better than Siamese Networks:
nb_labeled | supervised | manifold | Alpha | cluster | Beta | Radius | Accuracy |
---|---|---|---|---|---|---|---|
100000 | 'abs_hinge' | 'SN' | 0.5 | 'S3VM' | 0.5 | 1 | 41.80% |
100000 | 'abs_hinge' | 'LE' | 0.5 | 'S3VM' | 0.5 | 1 | 65.52% |
- Absolute Quadratic Loss Use absolute quadratic loss may be better for labeled data:
nb_labeled | supervised | manifold | Alpha | cluster | Beta | Radius | Accuracy |
---|---|---|---|---|---|---|---|
100000 | 'abs_hinge' | 'LE' | 0.5 | 'S3VM' | 0.5 | 1 | 65.52% |
100000 | 'abs_quadratic' | 'LE' | 0.5 | 'S3VM' | 0.5 | 1 | 69.93% |
- Neighbor Radius For manifold(graph) loss, appropriate neighbor radius may help the accuracy(to avoid overfitting). But in fact this does not help:
nb_labeled | supervised | manifold | Alpha | cluster | Beta | Radius | Accuracy |
---|---|---|---|---|---|---|---|
100000 | 'abs_quadratic' | 'LE' | 0.5 | 'S3VM' | 0.5 | 1 | 69.93% |
100000 | 'abs_quadratic' | 'LE' | 0.5 | 'S3VM' | 0.5 | 2 | 67.34% |
100000 | 'abs_quadratic' | 'LE' | 0.5 | 'S3VM' | 0.5 | 3 | 62.05% |
- Other Model Structure
Try Auxiliary EmbedCNN:
This model gives a relatively high accuracy with
100000
labeled data,'abs_quadratic'
supervised loss and'S3VM'
unsupervised loss:
Alpha | 0.0 | 1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | 0.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Beta | 0.0 | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 |
Accuracy | 68.10% | 58.66% | 62.87% | 63.00% | 62.16% | 69.43% | 69.58% | 70.43% | 71.62% | 74.18% | 77.37% | 68.17% |
But it does not mean the more intricate the architecture is, the higher the accuracy will be. If the auxiliary net for unsupervised data is another stack of convolutional and maxpooling layers, the accuracy can decrease.
Other Models
Semi-supervised Support Vector Machine: Memory Error
SVM would take a lot of memory, e.g.
It's better to be noticed that currently it is not easy at all to find a repository for multi-classification semi-supervised SVM. The most common code is multi-classification SVM or binary S3VM.
Discriminative Restricted Boltzmann Machine
Semi-supervised DRBM algorithm is proposed in paper Classification using Discriminative Restricted Boltzmann Machines:
$$ \mathcal{L}{\text{semi-sup}} (\mathcal{D}{\text{train}} , \mathcal{D}{\text{unlab}}) = \mathcal{L}{\text{TYPE}}(\mathcal{D}{\text{train}}) + \beta \mathcal{L}{\text{unsup}} (\mathcal{D}_{\text{unlab}}) $$
$$ = -\sum_{i=1}^{|\mathcal{D}{\text{train}}|} \log p(y_i | \mathbf{x_i}) - \beta \sum{i=1}^{|\mathcal{D}_{\text{unlab}}|} \log p(\mathbf{x_i}) $$
or
$$ = -\sum_{i=1}^{|\mathcal{D}{\text{train}}|} \log p(y_i | \mathbf{x_i}) -\alpha \sum{i=1}^{|\mathcal{D}{\text{train}}|} \log p(y_i , \mathbf{x_i}) - \beta \sum{i=1}^{|\mathcal{D}_{\text{unlab}}|} \log p(\mathbf{x_i}) $$
The implementation of semi-supervised DRBM consulted other people's code. The most common repository is RBM. I only found two repositories implemented by Tensorflow
or Theano
, both of them are supervised only.
The extension to semi-supervised is not difficult, as long as one pay attention to the contrastive divergence algorithm. The following code would illustrate:
def _calc_unsupervised_grads(self, x):
x0, h0, x1, h1 = self._gibbs_un_sampling_step(x)
h0 = tf.reshape(h0, [-1, self.num_hidden, 1])
x0 = tf.reshape(x0, [-1, self.num_visible, 1])
h1 = tf.reshape(h1, [-1, self.num_hidden, 1])
x1 = tf.reshape(x1, [-1, self.num_visible, 1])
# no gradient for U, but use zero to take place here
d_U_unsup = tf.constant(0.0)
d_W_unsup = tf.reduce_mean(tf.matmul(h0, x0, transpose_b=True) - tf.matmul(h1, x1, transpose_b=True), 0)
d_b_unsup = tf.reduce_sum(x0 - x1, 0)
d_c_unsup = tf.reduce_sum(h0 - h1, 0)
d_d_unsup = tf.constant(0.0)
return d_U_unsup, d_W_unsup, d_b_unsup, d_c_unsup, d_d_unsup
We know that the difference between supervised and semi-supervised is one-hot label y
. Thus make the gradients related to y
be tf.constant(0.0)
would make them contribute nothing to the loss. Meanwhile Tensorflow
can work even we have no method to figure out the shape of the tf.placeholder
batch.
All-0 Prediction
Sometimes the model would predict the same label(0
for dataset of [0, 1, 2, 3, 4]
for example) in EmbedCNN
and Ladder Network(LN)
.
For EmbedCNN
The problem for EmbedCNN can usually be solved by:
-
Try different initializers. Replace the default initializer with
lecun_uniform
for example. -
Clip gradient. Limit the gradient so they won't go to infinite. Code is usually like:
opt = tf.train.AdagradOptimizer(lr) gvs = opt.compute_gradients(loss_) clipped_gvs = [(tf.clip_by_value(grad, -1.0, 1.0), var) for grad, var in gvs] train_step = opt.apply_gradients(clipped_gvs)
-
Use different activations. For example, use
tanh
andsigmoid
instead of usingrelu
only. -
Sometimes directly use a 2-layer
dense
network would give a good result. -
Add
Dropout
layer to the network.
For LN
The major problem is houston
dataset. It's weird because the LN with the same architecture works well on paviaU
and indian
. But for houston
, the prediction will be [0, 0, 0, ..., 0]
. A little debug tip here:
-
Run the model in very short time(10 epoches for example) first and print all predicted labels. In this way, one can save a lot of time to verify if his model works before entering the time-costing training.
-
Make sure that training dataset is reasonable. The split of labeled and unlabeled training data should reasonable: each class should at least have a certain and balanced number of data.
-
Transform dataset. Shuffle, crop or PCA the true dataset and then train it to see if model can make multi-label prediction. If the model still make prediction like
[0, ..., 0]
, there must be something wrong with the dataset instead of the model. -
Remove the predicted class. If the model always predict
2
for dataset with[0, 1, 2, 3, 4]
, try remove the2
data and train with[0, 1, 3, 4]
to see if the model still only predict2
. -
Check the weights and layer output. Check weights after training to see if it is the weights that cause the problem. Especially the weight of the last layer.
Then we can finally locate the problem is weight and loss become nan
in training. This problem is most probably caused by infinite gradient or loss. Thus clip it would make the situation much better.
However, one may find that Tensorflow's clip_by_value
would give errors. This is mainly because the gradient itself is nan
and loss is inf
. Here in houston
dataset, when the labeled loss is around 2.5, the unlabeled loss is 3000, 5000, inf.
This problem can be traced back to the selection of unlabeled data instead of the architecture of the network, if one break at the batch input code line to check. There are so many duplicated unlabeled data. This will not affect EmbedCNN since it picks only one unlabeled sample at a time, but will greatly influence LN.
Now can check if it is the unlabeled data unshuffled. Because if not shuffled, the data in batch would be very similar or even the same arccording to the physical properties of dataset. The problem can be solved by shuffling.
Improve Ladder Networks, Houston
The original loss in the paper Semi-Supervised Learning with Ladder Network is:
Give the unsupervised part a coefficient alpha < 1.0
so that the unstructured unlabeled part would not strongly affect the labeled one:
Writing
We implement three models of deep learning via semi-supervised embedding for paivaU, indian and Houston datasets. For comparision purposes, we also provide the performance of label spreading model and ladder networks.
In the semi-supervised embedding experiment, we define the deep learning model with convolutional neural network. For output model, both labeled and unlabeled data are processed by convolutional and maxpooling layers with a window of size 2. The number of filters is {32, 16}. Then we use a two-layer neural network for classification with {64, 128} hidden neurons. The total loss is computed by absolute quadratic and Laplacian Eigenmap combined with unlabeled loss of S3VM respectively. There is a balance inside the unsupervised loss between embedding algorithm and S3VM loss. We find that the task prefers S3VM loss to embedding algorithm. According to the ressult, a combination of 0.1 Laplacian Eigenmap loss and 0.9 S3VM loss usually gives a good result with learning rate = 0.1.
The other two semi-supervised deep learning models share similar experiment settings. The input model uses convolutional, maxpooling and fully connected layers at first and actives unlabeled data to output by sigmoid. Labeled data will continue to propagate forward through the following layers. Such architecture makes the input model "narrow" in the middle. Besides the similar shared convolutional layers at first, auxiliary model uses a seperate neural network with sigmoid activation as a classifier for unlabeled data. The result reveals that the auxiliary model excels at the experiment because it makes the embedding and classification of unlabeled data different from the labeled data.
The computation bottleneck is finding neighbors, or computing matrix
The effectiveness of semi-supervised deep learning model can be achieved by the comparision with label spreading and ladder networks. We use the Scikit-learn tool to implement the label spreading model described in Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. In Semi-Supervised Learning (2006), pp. 193-216 with Radial basis function(RBF) kernel. \texttt{gamma} for RBF kernel in the experiment is 0.25, and maximum number of iterations allowed is 5. The ladder network experiment has layers in size of {1000, 500, 250, 250, 250} between input and output layers. The learning rate is 0.02 and begins to decay after 15 epoches. Besides, we give the unlabeled loss a coefficient
paviaU | Houston | India | |
---|---|---|---|
Supervised CNN | 66.08% | 45.67% | 32.58% |
Ladder Network | 62.70% | 49.35% | 53.85% |
Label Spreading | 77.81% | 35.86% | 63.17% |
Output EmbedCNN | 71.88% | 66.31% | 47.07% |
Input EmbedCNN | 71.34% | 45.67% | 45.02% |
Auxiliary EmbedCNN | 77.37% | 75.66% | 64.78% |
Ref:
-
Deep Learning via Semi-supervised Embedding Deep learning via Semi-supervised Embedding, Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert Semi-Supervised Learning Literature Survey, Xiaojin Zhu
-
label spreading Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. Semi-Supervised Learning (2006), pp. 193-216
-
Ladder Network Semi-Supervised Learning with Ladder Network by A Rasmus, H Valpola, M Honkala, M Berglund, and T Raiko
-
Descriminative Restricted Boltzmann Machine Classification using Discriminative Restricted Boltzmann Machines, Hugo Larochelle, Yoshua Bengio Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville