javierantoran / bayesian-neural-networks Goto Github PK

Pytorch implementations of Bayes By Backprop, MC Dropout, SGLD, the Local Reparametrization Trick, KF-Laplace, SG-HMC and more

License: MIT License

Jupyter Notebook 99.17% Python 0.83%

approximate-inference bayes-by-backprop bayesian-inference bayesian-neural-networks classification deep-learning hmc kronecker-factored-approximation langevin-dynamics local-reparametrization-trick mc-dropout mcmc out-of-distribution-detection pytorch regression reproducible-research sgld uncertainty uncertainty-neural-networks variational-inference

bayesian-neural-networks's People

Contributors

Stargazers

Watchers

Forkers

neelindresh jizhihang yushu-liu mcgrady20150318 zhong2024 kk2491 sprinterzzj birajaghoshal vlbthambawita doandongnguyen zhaobin-li mahendra-ramajayam shu-hai zehsilva gumpfly cabiste007 curtainsky xhivaw sergiosonline vu-luong arendu-zz vinayk94 tonylibing engcompaulo hongxiangharry kumar-shridhar risaueno nadia-alam richardhongyu yizhanyang keruwu bzp92 stjordanis hieuqtran ykwon0407 mevazi10 valeman vpdota2 cuijinyuan lcx96 zhugecoming jhwann leoyong kaizhang9751 jimmy-inl zlapp queenie-liqiyu konradbachusz-zz ahatamiz huangpu1 wwhappylife mengxiangming ahavenoname bluecontra congzhengithub muleina qinliuliuqin progoerke wangyongguang minggangzhao alexandrechenu rachelxuan11 suyanzhou626 whoyouwith91 jaeikjeon9919 yuchenlichuck premjithb bluematrix007 shellingford221 wangsiqinudt nikhil-dce qingervt hewanbing manuelschmidt lianglili ridang pengfight conancui 321hg yzr1991 meetvadera lvzongyao praveen5733 wylx14 berry1111 huiwong96 zcemycl tang-agui kckishan yujichai exp-deeplearning-tools lmydian1014 kw-lee minggli ltdung xuxingxjtu wasif1508 lh20180514 ritesh99rakesh longlivesocialism

bayesian-neural-networks's Issues

[Question][MC dropout] drop hidden units for each data point or for each mini-batch data?

Hi JavierAntoran,

Thanks for the wonderful code first, and it is really helpful for me working in the related area. I'd like to consult a question about the MC dropout. In BBB with local reparameterization, the activation values are sampled for each data point instead of directly sampling a weight distribution to reduce the computational complexity. So, in MC dropout, shall we do the similar procedure, e.g. dropout hidden units for each data point in training or testing phase? I notice that your MC dropout model seems uses the same dropout for a mini-batch data and the default batch size is 128. Should I change the batch size to 1 to achieve the goal of dropping hidden units for each data point?

Looking forward to your reply. Thanks a lot

about the sampled output of bayes by backprop

Hi, in BBB, we sample the outputs several times (5 for example) when testing. This is due to the nature of MC. But how do we use these 5 outputs of one sample? Averaging them? Summing them? Or just take the best one as this sample's output vector? Are there any theory about how to use these sampled results? Thanks!

Thank you!

Hi, thank you so much for uploading these codes. As a novice, I don't know how to call the model and how the laptop works. Could you please give a general explanation? Thank you very much!

About calculating the logarithm of the Gaussian mixture model

Hi! First of all, thank you for providing your code！
I think there is an error in class "spike_slab_2GMM" in the "priors.py" file:
normalised_like = self.pi1 + torch.exp(N1_ll - max_loglike) + self.pi2 + torch.exp(N2_ll - max_loglike)
After calculation and verification, I think it should be changed to:
normalised_like = self.pi1 * torch.exp(N1_ll - max_loglike) + self.pi2 * torch.exp(N2_ll - max_loglike)
I'm sorry to disturb you in your busy schedule~

code implementation

Thank you for the code.

How can i run in cpu? the code only support CUDA?

Regards

Scaling langevin noise

Hi @JavierAntoran @stratisMarkou ,

I am currently using SGLD in another project. I have some example code, where the noise added to the parameters is scaled by the learning rate:

noise = torch.randn(n.size()) * param_noise_sigma * lr.

I belive, this is wrong and so I scaled the noise with the square root of the learning rate:

noise = torch.randn(n.size()) * np.sqrt(lr)

This works well. However, you scale the learning rate by

1/np.sqrt(sigma)

If I do this, the noise gets way too big and the network does not learn. Also, in the initial paper by Welling et al. it seems to me that the first approach is right, though I already found out that the variance of the noise scales with lr**2, which is why the square root is necessary.

I hope you can explain to me, why the ()^-1 is necessary. Thank you in advance!

Energy Efficient dataset

For bbp_hetero notebook, the Energy Efficiency dataset has 8 inputs and 2 outputs.
At the moment, output 1 is being used as an input to model output 2.
You can just drop the last column of the dataset and model 8 inputs and 1 output.
I tried and worked fine.

UCI log likelihood calculation

Hi. Great Repository! I was having trouble recreating the results from some of the papers which you implemented and I found your examples to be more complete than the original author repositories which is very helpful.

I have a question regarding your calculation of the test set log likelihood of the UCI datasets...

In this file (https://github.com/JavierAntoran/Bayesian-Neural-Networks/blob/master/notebooks/regression/mc_dropout_hetero.ipynb) you use the flow...

- make tensor of (samples, mu) and (samples, log_sigma)
- use means.mean() as mu and (means.var() + mean(sigma) ** 2) ** 0.5 as sigma
- calculate sum of all log probability / data instances

Then as a last step you add log(y_sigma) which is the standard deviation of the y values before they were normalized. Why do you do this and where does the necessity to do this come from?

The original MC Dropout and Concrete Dropout repositories use some form of the logsumexp trick do calculate the log probabilities, but so far I have been unable to recreate the UCI results using their method. I get much closer with your method but I cannot justify the last step to myself.

Reference

Concrete Dropout

MC Dropout

about sample or not when training

Hello again! In BBB method, you sample the weight no_samples times and average the loss when training (bbp_homo.ipynb def fit(self, x, y, no_samples)), but in MC dropout method, you don't sample and just get one loss as the final loss of training (mc_dropout_heteroscedastic.ipynb def fit(self, x, y)). But when testing, losses are both sampled. I think the only difference between BBB and MC dropout is that the approximate posterior is assumed as Gaussian by BBB and Bernoulli by MC dropout, so why don't you sample in MC dropout method when training? Thanks!

May I ask the code about calculating the Hessian in the Kronecker-Factorised Laplace methods?

def softmax_CE_preact_hessian(last_layer_acts):
side = last_layer_acts.shape[1]
I = torch.eye(side).type(torch.ByteTensor)
# for i != j H = -ai * aj -- Note that these are activations not pre-activations
Hl = - last_layer_acts.unsqueeze(1) * last_layer_acts.unsqueeze(2)
# for i == j H = ai * (1 - ai)
Hl[:, I] = last_layer_acts * (1 - last_layer_acts)
return Hl

This function calculates the hessian for the first layer's activations.
Why can the hessian be obtained by this process?

for i != j H = -ai * aj -- Note that these are activations not pre-activations

for i == j H = ai * (1 - ai)

Is this process shown in the paper (https://openreview.net/pdf?id=Skdvd2xAZ)?

Looking forward to your reply! Really thank you!

2 questions: batch & layer

Here are two questions, and would you please help me with them:

In the COLAB notebook, this code didn't train in batches, right?
In the COLAB notebook, the MC_Dropout_Model can run without the MC_Dropout_Layer?

a question about the calculation of uncertainty

Hi, in bbp_homoscedastic.ipynb, you calculate aleatoric uncertainty as sigma in Gaussian, and epistemic uncertainty as standard deviation of model's outputs, and the total uncertainty is (aleatoric2 + epistemic2)**0.5. But according to the decomposition of predictive uncertainty, aleatoric uncertainty is the expected entropy of model predictions and epistemic uncertainty is the difference between total entropy and aleatoric entropy. I wonder that are these two ways actually the same? In a classification task, I can easily calculate aleatoric and epistemic uncertainty in the way of entropy, but I don't know how to calculate them in your way.

Besides, I also have a question about the meaning of uncertainties. We often say that BNN is more robust because we can give the uncertainty (or confidence) of the output, but aleatoric and epistemic uncertainties are definitely different from the probabilistic perspective. For example, if the measurements are drawn from Gaussian, then we can say that we have the probability of 95% of the confidence interval of -2 * sigma~2 * sigma. But in BNN, we just can't give the confidence of output like that (for example the classification task). So rather than using aleatoric and epistemic uncertainties in the loss function to make the model perform better, can we use them to give the confidence of output to make the predictions more acceptable for some areas like medical diagnosis? (For example, after predicting by a BNN, I have the confidence of 95% to assure that the patient is healthy (label 0) and 5% to think that he is a lung cancer patient (label1) rather than only giving one result by point-estimated neural network.)
Thanks!

Can you please upload the code for the toy regression task

about Gaussian prior

Hi, in bbp_homoscedastic.ipynb, it seems that you choose a normal Gaussian prior rather than a scale mixture prior. I think that mixture Gaussian prior can better model the real distribution of weight w. Thanks!

Looking for the implementation of "Getting a GlUE"

Hi @JavierAntoran ,
my name is Hao, I'm searching for a valid method for the Uncertainty Explanations, like to decompose the Uncertainty of the prediction back to the features, until I found your paper "Getting a GLUE: A Method for Explaining Uncertainty Estimates", and find it fantastic, do you have implementations of the "CLUE" in this repo or where can I find it, because I see the reference of this github in the paper? I'm researching on using ICU data to do Sepsis Prediction, really appreciate it if you can help.

Thanks in Advance
Hao

this code is writen in python 2.7 .can this code change to the envriment of python 3.6

BNN Bayesian Online Learning

(New Feature)
Do you think that BNN Bayesian online learning for this code implementation will perform well?
I mean, periodic BNN retraining using small amounts of data and using the posterior as the prior.
If positive, do you have recommendation/ideas the best way to implement it? (like using the full network from the previous iteration).

Posterior Predictive Distribution

Hi,

I’m trying to approximate the posterior predictive distribution that corresponds to the MC Dropout and Bayes By Backprop neural networks (which I see you say is possible in the section MNIST classification in README). I’m new to Python, so I’m having a little trouble figuring out how exactly you do this/ what part of the code carries this out?

I tried to go about it by playing around with your function get_weight_samples, but noticed that this gives me the same weights each time I train the same network. For example, when training the MC Dropout network, I assumed the output of get_weight_samples would change as in theory different nodes are dropped during training each time. My confusion here makes me think that I maybe have misinterpreted what this function is supposed to be doing.

Any clarification would be greatly appreciated! I’m sorry if this wasn’t the right place to post a question of this nature – new to Github and still learning the ropes.

about local reparametrisation trick

Hi, in local reparametrisation trick, when computing output of convolution layer, we use alpha * mu^2 to replace sigma^2. But when computing KL, should we also use alpha * mu^2 to replace the weight's variance? Why or why not? Thanks!

[Question] BBB vs BBB w/ Local Reparameterization

Hi @JavierAntoran @stratisMarkou,

First of all, thanks for making all of this code available - it's been great to look through!

Im currently spending some time trying to work through the Weight Uncertainty in Neural Networks in order to implement Bayes-by-Backprop. I was struggling to understand the difference between your implementation of Bayes-by-Backprop and Bayes-by-Backprop with Local Reparameterization.

I was under the impression that the local reparameterization was the following:

Bayesian-Neural-Networks/src/Bayes_By_Backprop/model.py

Lines 58 to 66 in 022b9ce

 eps_W = Variable(self.W_mu.data.new(self.W_mu.size()).normal_()) 

 eps_b = Variable(self.b_mu.data.new(self.b_mu.size()).normal_()) 

 # sample parameters 

 std_w = 1e-6 + F.softplus(self.W_p, beta=1, threshold=20) 

 std_b = 1e-6 + F.softplus(self.b_p, beta=1, threshold=20) 

 W = self.W_mu + 1 * std_w * eps_W 

 b = self.b_mu + 1 * std_b * eps_b

However this same approach is used in both methods.

The main difference I see in the code you've implemented is the calculation of the KL Divergence in closed form in the Local Reparameterization version of the code due to the use of a Gaussian prior / posterior distribution.

I was wondering if my understanding of the local reparameterization method was wrong, or if I had simply misunderstood the code?

Any guidance would be much appreciated!

bbp_hetero uncertainties plot

For the bbp_hetero notebook, should be plot be done as:
plt.fill_between(np.linspace(-5, 5, 200)* x_std + x_mean, means - aleatoric, means + aleatoric, color = c[0], alpha = 0.3, label = 'Aleatoric')
plt.fill_between(np.linspace(-5, 5, 200)* x_std + x_mean, means - epistemic, means + epistemic, color = c[1], alpha = 0.3, label = 'Epistemic')

At the moment, the boundaries are mixed.

Thanks for the attention so far,
Celso

bbp_hetero notebook is not using three layers

In the class BBP_Heteroscedastic_Model_UCI, three layers are declared but in the forward method layer3 calling is missing. I added and worked fine.

Question: log_gaussian_loss function used in MC Dropout and SGLD

Firstly, thank you for all these great notebooks, they've been very helpful in building a better understanding of these methods.

I am wondering where the function log_gaussian_loss originates from? I'm struggling to find reference to it in the literature, though very likely looking in the wrong places.

In MC dropout it seems that one output neuron is for the prediction and another that feeds into this loss function, and I'm struggling to get a simpler version working whereby there's one output neuron and a different loss function. Where does this technique originate from?

Thanks again

Panda not reading the first row

Minor bug.
Instead of:
data = pd.read_csv('housing.data', header=0, delimiter="\s+").values
Use:
data = pd.read_csv('housing.data', header=None, delimiter="\s+").values

	eps_W = Variable(self.W_mu.data.new(self.W_mu.size()).normal_())
	eps_b = Variable(self.b_mu.data.new(self.b_mu.size()).normal_())

	# sample parameters
	std_w = 1e-6 + F.softplus(self.W_p, beta=1, threshold=20)
	std_b = 1e-6 + F.softplus(self.b_p, beta=1, threshold=20)

	W = self.W_mu + 1 * std_w * eps_W
	b = self.b_mu + 1 * std_b * eps_b