Giter VIP home page Giter VIP logo

Comments (4)

s9xie avatar s9xie commented on September 13, 2024

@happynear
This is a good question, and should be a common one. Of course one can tune the gamma based on the validation set, but this is really annoying. We have tried that, but soon we came up with another idea to implement our formulation and avoid overfitting.

So if you look at our experiment configuration files, you can see we adopted an early stopping policy during the training process, i.e, we first train the network with DSN for a number of epochs (which is determined by validation) and we discard all the companion losses and continue to train the network with only the output loss.

The gamma now is implicitly and dynamically determined by the loss value achieved at the time when we early stop, empirically this is essential for DSN to achieve good performance.

from dsn.

zhangliliang avatar zhangliliang commented on September 13, 2024

In @happynear comment, \gamma is setted to prevent the hinge loss to be 0.
However, from my point of view, \gamma is setted to $make$ the hinge loss of hidden layers to be 0(i.e. to vanish the gradient), so as \alpha_m does the same thing.
But I don't know the purpose to vanish the gradient in the paper. Is it for speed up the training process because it can skip part of BP algorithm? @s9xie

from dsn.

s9xie avatar s9xie commented on September 13, 2024

@happynear @zhangliliang Sorry yes it is not "preventing hinge to be zero" but vanishing it. I assume it is a typo in original question?

In our paper we have explained that:
"This way, the overall goal of producing good classification of the output layer is not altered and the companion objective just acts as a proxy or regularization."
Intuitively we should emphasize the role of the overall loss during the training, this "early stop" policy can be a good way to avoid over-fitting the lower layers into the local loss.

from dsn.

sh0416 avatar sh0416 commented on September 13, 2024

@s9xie I am working on implementing your method. So, you mean that you don't explicitly use gamma, right? Actually, I am also curious about the other hyperparameter alpha, which requires exponential search space when layer increases. When I see your paper, you use relatively small architecture, i.e. 3-layer NN. How to tune this hyperparameter?

Thanks,

from dsn.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.