cpmpercussion / keras-mdn-layer Goto Github PK

An MDN Layer for Keras using TensorFlow's distributions module

License: MIT License

Python 15.64% Jupyter Notebook 84.36%

mdn mixture-density-network neural-network mdn-rnn tensorflow keras

keras-mdn-layer's Introduction

Keras Mixture Density Network Layer

A mixture density network (MDN) Layer for Keras using TensorFlow's distributions module. This makes it a bit more simple to experiment with neural networks that predict multiple real-valued variables that can take on multiple equally likely values.

This layer can help build MDN-RNNs similar to those used in RoboJam, Sketch-RNN, handwriting generation, and maybe even world models. You can do a lot of cool stuff with MDNs!

One benefit of this implementation is that you can predict any number of real-values. TensorFlow's Mixture, Categorical, and MultivariateNormalDiag distribution functions are used to generate the loss function (the probability density function of a mixture of multivariate normal distributions with a diagonal covariance matrix). In previous work, the loss function has often been specified by hand which is fine for 1D or 2D prediction, but becomes a bit more annoying after that.

Two important functions are provided for training and prediction:

get_mixture_loss_func(output_dim, num_mixtures): This function generates a loss function with the correct output dimensiona and number of mixtures.
sample_from_output(params, output_dim, num_mixtures, temp=1.0): This functions samples from the mixture distribution output by the model.

Installation

This project requires Python 3.6+, TensorFlow and TensorFlow Probability. You can easily install this package from PyPI via pip like so:

python3 -m pip install keras-mdn-layer

And finally, import the module in Python: import keras_mdn_layer as mdn

Alternatively, you can clone or download this repository and then install via python setup.py install, or copy the mdn folder into your own project.

Build

This project builds using poetry. To build a wheel use poetry build.

Examples

Some examples are provided in the notebooks directory.

To run these using poetry, run poetry install and then open jupyter poetry run jupyter lab.

There's scripts for fitting multivalued functions, a standard MDN toy problem:

There's also a script for generating fake kanji characters:

And finally, for learning how to generate musical touch-screen performances with a temporal component:

How to use

The MDN layer should be the last in your network and you should use get_mixture_loss_func to generate a loss function. Here's an example of a simple network with one Dense layer followed by the MDN.

from tensorflow import keras
import keras_mdn_layer as mdn

N_HIDDEN = 15  # number of hidden units in the Dense layer
N_MIXES = 10  # number of mixture components
OUTPUT_DIMS = 2  # number of real-values predicted by each mixture component

model = keras.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, batch_input_shape=(None, 1), activation='relu'))
model.add(mdn.MDN(OUTPUT_DIMS, N_MIXES))
model.compile(loss=mdn.get_mixture_loss_func(OUTPUT_DIMS,N_MIXES), optimizer=keras.optimizers.Adam())
model.summary()

Fit as normal:

history = model.fit(x=x_train, y=y_train)

The predictions from the network are parameters of the mixture models, so you have to apply the sample_from_output function to generate samples.

y_test = model.predict(x_test)
y_samples = np.apply_along_axis(sample_from_output, 1, y_test, OUTPUT_DIMS, N_MIXES, temp=1.0)

See the notebooks directory for examples in jupyter notebooks!

Load/Save Model

Saving models is straight forward:

model.save('test_save.h5')

But loading requires cutom_objects to be filled with the MDN layer, and a loss function with the appropriate parameters:

m_2 = keras.models.load_model('test_save.h5', custom_objects={'MDN': mdn.MDN, 'mdn_loss_func': mdn.get_mixture_loss_func(1, N_MIXES)})

Acknowledgements

Hat tip to Omimo's Keras MDN layer for a starting point for this code.
Super hat tip to hardmaru's MDN explanation, projects, and good ideas for sampling functions etc.
Many good ideas from Axel Brando's Master's Thesis
Mixture Density Networks in Edward tutorial.

References

Christopher M. Bishop. 1994. Mixture Density Networks. Technical Report NCRG/94/004. Neural Computing Research Group, Aston University. http://publications.aston.ac.uk/373/
Axel Brando. 2017. Mixture Density Networks (MDN) for distribution and uncertainty estimation. Master’s thesis. Universitat Politècnica de Catalunya.
A. Graves. 2013. Generating Sequences With Recurrent Neural Networks. ArXiv e-prints (Aug. 2013). https://arxiv.org/abs/1308.0850
David Ha and Douglas Eck. 2017. A Neural Representation of Sketch Drawings. ArXiv e-prints (April 2017). https://arxiv.org/abs/1704.03477
Charles P. Martin and Jim Torresen. 2018. RoboJam: A Musical Mixture Density Network for Collaborative Touchscreen Interaction. In Evolutionary and Biologically Inspired Music, Sound, Art and Design: EvoMUSART ’18, A. Liapis et al. (Ed.). Lecture Notes in Computer Science, Vol. 10783. Springer International Publishing. DOI:10.1007/9778-3-319-77583-8_11

keras-mdn-layer's People

Contributors

Stargazers

Watchers

keras-mdn-layer's Issues

2D MDN example not working

I was running you 2D example on local machine and I came across this error

Traceback (most recent call last):
File "main.py", line 36, in
history = model.fit(x=x_input, y=y_input, batch_size=128, epochs=300, validation_split=0.15, callbacks=[keras.callbacks.TerminateOnNaN()])
File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 960, in fit
validation_steps=validation_steps)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1572, in fit
batch_size=batch_size)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1411, in _standardize_user_data
exception_prefix='target')
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 153, in _standardize_input_data
str(array.shape))
ValueError: Error when checking target: expected mdn_1 to have shape (None, 50) but got array with shape (5000, 2)

The added layer must be an instance of class Layer. Found: <mdn.MDN object at 0x7fbd511f1860>

I get an error message when trying to use the code below How to use in the readme file.

I am using colab with Python 3.6.9 and TensorFlow 1.15.0.

!python --version

prints out Python 3.6.9 and

%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

prints out 1.15.0.

I have installed the latest version of the keras-mdn-layer package with

!pip install keras-mdn-layer

Collecting keras-mdn-layer
  Downloading https://files.pythonhosted.org/packages/f3/90/7c9233a1b334bf91bc7f9ec2534eb40f7bb418900f35cbd201864c600cf6/keras-mdn-layer-0.3.0.tar.gz
Building wheels for collected packages: keras-mdn-layer
  Building wheel for keras-mdn-layer (setup.py) ... done
  Created wheel for keras-mdn-layer: filename=keras_mdn_layer-0.3.0-cp36-none-any.whl size=7054 sha256=e53024a3d12d2c6bc1faa4ef682c44b5dbf2e8f6cad7ad8876a9cbecb84b666b
  Stored in directory: /root/.cache/pip/wheels/b6/e3/ba/8fb07898b8c8e5d4c1a035add0b71629b2fbe82ee8a5f0a2c8
Successfully built keras-mdn-layer
Installing collected packages: keras-mdn-layer
Successfully installed keras-mdn-layer-0.3.0

Code (copied from the readme file without any changes):

import keras
import mdn

N_HIDDEN = 15  # number of hidden units in the Dense layer
N_MIXES = 10  # number of mixture components
OUTPUT_DIMS = 2  # number of real-values predicted by each mixture component

model = keras.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, batch_input_shape=(None, 1), activation='relu'))
model.add(mdn.MDN(OUTPUT_DIMS, N_MIXES))
model.compile(loss=mdn.get_mixture_loss_func(OUTPUT_DIMS,N_MIXES), optimizer=keras.optimizers.Adam())
model.summary()

Error message :

Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-5428ceff4415> in <module>()
      8 model = keras.Sequential()
      9 model.add(keras.layers.Dense(N_HIDDEN, batch_input_shape=(None, 1), activation='relu'))
---> 10 model.add(mdn.MDN(OUTPUT_DIMS, N_MIXES))
     11 model.compile(loss=mdn.get_mixture_loss_func(OUTPUT_DIMS,N_MIXES), optimizer=keras.optimizers.Adam())
     12 model.summary()

/usr/local/lib/python3.6/dist-packages/keras/engine/sequential.py in add(self, layer)
    131             raise TypeError('The added layer must be '
    132                             'an instance of class Layer. '
--> 133                             'Found: ' + str(layer))
    134         self.built = False
    135         if not self._layers:

TypeError: The added layer must be an instance of class Layer. Found: <mdn.MDN object at 0x7f65b87e3da0>

What am I missing?

test time-distributed MDN-RNNs

Convert to a scipy distribution object

It would be nice to have a helper function that takes the output of the model and converts it to a scipy.stats rv object so that it can be used to compute density or cumulative density, etc.
I’m guessing it’s not complicated to do, looking at the example in https://github.com/cpmpercussion/keras-mdn-layer/blob/master/notebooks/MDN-2D-spiral-prediction.ipynb (specifically the cell starting with “Plot the means”).

A couple of questions concerning your 1D sine example

Hello @cpmpercussion,

Thank you so much for your contribution, it is very valuable for those who are not still very much familiar with Keras and Tensorflow, such as me! 🥇

Glancing through your 1D sine prediction example, I was pretty surprised of how accurate it is, provided that it is using 10 gaussians and only 15 activations within each hidden layer! Isn't usually more advisable to have more hidden nodes than output nodes to prevent from any information loss?

I am trying to use your code to emulate Bishop's example for an inverted sine, however, I am still not able to achieve very good prediction results, as you may see...
I seem to obtain much better results using Matlab for the very same set of hyperparameters:

Adam optimizer, step size = 1e-3, beta_1 = 0.9, beta_2 = 0.999
NSAMPLE = 1000, validation split = 0.3
Batch size = NSAMPLE (batch gradient descent)
N_HIDDEN = 20, N_MIXES = 3
1000 test samples
Nepochs = 3000

Please find also my Matlab image attached here (note that there 'validation' referest to test samples, and vice versa). The code I used (I barely introduced the modifications above from yours) is found bellow.

import keras
import mdn
import numpy as np
import matplotlib.pyplot as plt

## Generating some data:
NSAMPLE = 1000

x_data = np.random.uniform(0, 1, NSAMPLE)			# Predictor variable
y_data = x_data + 0.3*np.sin(2*np.pi*x_data) + np.random.uniform(-0.1, 0.1, NSAMPLE) # np.random.randn(n_row)
x_data, y_data = y_data, x_data

plt.figure(figsize=(8, 8))
plt.plot(x_data,y_data,'ro', alpha=0.3)
plt.show()

N_HIDDEN = 20
N_MIXES = 3

model = keras.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, batch_input_shape=(None, 1), activation='tanh'))
model.add(mdn.MDN(1, N_MIXES))
model.compile(loss=mdn.get_mixture_loss_func(1,N_MIXES), optimizer=keras.optimizers.Adam()) #, metrics=[mdn.get_mixture_mse_accuracy(1,N_MIXES)])
model.summary()

history = model.fit(x=x_data, y=y_data, verbose=0, batch_size=NSAMPLE, epochs=3000, validation_split=0.3)

plt.figure(figsize=(10, 5))
plt.ylim([-3,3])
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.show()

## Sample on some test data:
x_test = np.float32(np.arange(0,1,0.001))
NTEST = x_test.size
print("Testing:", NTEST, "samples.")
x_test = x_test.reshape(NTEST,1) # needs to be a matrix, not a vector

# Make predictions from the model
y_test = model.predict(x_test)
# y_test contains parameters for distributions, not actual points on the graph.
# To find points on the graph, we need to sample from each distribution.

# Sample from the predicted distributions
y_samples = np.apply_along_axis(mdn.sample_from_output, 1, y_test, 1, N_MIXES, temp=1.0)

# Split up the mixture parameters (for future fun)
mus = np.apply_along_axis((lambda a: a[:N_MIXES]),1, y_test)
sigs = np.apply_along_axis((lambda a: a[N_MIXES:2*N_MIXES]),1, y_test)
pis = np.apply_along_axis((lambda a: mdn.softmax(a[2*N_MIXES:])),1, y_test)

# Plot the samples
plt.figure(figsize=(8, 8))
plt.plot(x_data,y_data,'ro', x_test, y_samples[:,:,0], 'bo',alpha=0.3)
plt.show()
# These look pretty good!

# Plot the means - this gives us some insight into how the model learns to produce the mixtures.
plt.figure(figsize=(8, 8))
plt.plot(x_data,y_data,'ro', x_test, mus,'bo',alpha=0.3)
plt.show()
# Cool!

# Let's plot the variances and weightings of the means as well.
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
ax1.scatter(x_data,y_data,marker='o', c='r', alpha=0.3)
for i in range(N_MIXES):
    ax1.scatter(x_test, mus[:,i], marker='o', s=200*sigs[:,i]*pis[:,i],alpha=0.3)
plt.show()

Do you have any idea about why this is happening?

Thank you so much in advance, and may you have a nice day!

develop some notes on how to use this layer in projects

add a dance net example?

Negative Loss while training the model

Hi! I was training a mdn model with 4 outputs but after 2-3 epochs, the loss became negative. Is it possible that the loss can be negative ?

Loss function give negative values... bad news ?

Hi,

When training the MDN model, after some times, the custom "loss=mdn.get_mixture_loss_func(OUTPUT_DIMS,N_MIXES)" give negative values.

Note: Happens with Standard Scaling on Y. every thing else is quite simple:

What that means ? Is it a bug ? Is it still converging ?

thanks!

Check treatment of scale matrix vs covariance matrix in sampling procedure

There could be an issue with sampling due to (my) confusion about standard deviation and variance.

The samples are drawn using numpy like so (documentation) (line 238 of __init__.py)

sample = np.random.multivariate_normal(mus_vector, cov_matrix, 1)

But the output from the mixture density layer are treated as scale variables in tfp.distributions.MultivariateNormalDiag. This notes that:

covariance = scale @ scale.T

Thus, it seems we should have been squaring the cov_matrix before putting it into the multivariate normal sampling procedure. This could explain why we end up having to scale down the sigma variable so much in real-world applications.

A todo here is to get a definite answer and do some test to try out what's going on.

Reseting Model Weights

Hi, I was wondering whether reseting the model weights was available.

How could I apply SHAP model to my MDN?

input_parameters = [
    'Mass', 
    'Radius',
    'Fe/Mg bulk',
    'Mg/Si bulk',
    'k_2'
]

output_parameters = [
    'H2O_radial_frac',
    'Mantle_radial_frac',
    'Core_radial_frac',
    'Core_mass_frac',
    'P_CMB', 
    'T_CMB',
]

DROPOUT = 0.05
N_HIDDEN, N_MIXES = 512, 20  # N_MIXES is the number of mixtures
INPUT_DIMS = len(input_parameters)              
OUTPUT_DIMS = len(output_parameters)
ACT_FUN = 'relu'
my_callbacks = [
    EarlyStopping(
        monitor='val_loss',
        mode='min',
        patience=30, 
        verbose=0,
    ),
]
model = keras.Sequential()
model.add(Dense(N_HIDDEN, input_dim=INPUT_DIMS))
model.add(Activation(ACT_FUN))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation(ACT_FUN))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation(ACT_FUN))
model.add(Dropout(DROPOUT))
model.add(mdn.MDN(OUTPUT_DIMS, N_MIXES, name='mdn_outputs'))
model.summary()


model.compile(
    loss=mdn.get_mixture_loss_func(OUTPUT_DIMS,N_MIXES),
    optimizer=keras.optimizers.Adam(lr=0.0001)
)
history=model.fit(
    X_train_scaled,y_train_scaled,
    epochs=200,
    batch_size=512,
    validation_split=0.1,
    callbacks=my_callbacks,
    verbose=0)

pred = model.predict(X_test_scaled)
pred.shape

you will see

(7770, 260)

and then the SHAP model is applied (here is the github link https://github.com/slundberg/shap)

explainer = shap.KernelExplainer(model.predict, shap.sample(X_train_scaled, 10))
shap_values = explainer.shap_values(shap.sample(X_test_scaled, 10))
shap.summary_plot(shap_values, shap.sample(X_test_scaled, 10))

The output figure shows that SHAP model regards the MDN as a classification problem with 260 classes.
How could I calculate the shap value correctly?
Thanks!

Changing distribution, Bernoulli, Laplace

Hi,

It would be vey usefull to be able to tweak the type of distribution in the customized loss function. For example, for a binary classification problem, changing the distribution to Bernoulli (Or any other strategy to be able to do classification). For regression, changing Normal distribution to Laplace, Exponential, LogNormal or Gamma distribution to see if the results are better.

I tried to make some modifications myself on your code, but i'm really not sure about the final result !

Thanks !

Question relating the amount of output mixture parameters

Hi! Thanks for your implementation, great work. I am still learning about mixture density networks and have a question relating to the amount of output parameters.

In this implementation it seems that there are (2 * NUMBER_MIXTURES * OUTPUT_DIM) + (NUMBER_MIXTURES) parameters. For example, in the Kanji notebook this results in 2*10*3 + 10 = 70 outputs.

However, in some other implementations of MDN-RNN's such as the one from World Models (https://github.com/hardmaru/WorldModelsExperiments/blob/master/carracing/rnn/rnn.py) there seems to be 3 * NUMBER_MIXTURES * OUTPUT_DIM output parameters. I think the difference is that you only model NUMBER_MIXTURES mixing coefficients, and they model NUM_MIXTURES * OUTPUT_DIM mixture coefficients. In your Kanji notebook this would result in 3*10*3 = 90 outputs.

I was wondering why this difference between implementations exists? Is this due to the difference that your code does not model the covariance of the Gaussians or is that something unrelated? Thanks in advance!

[sampling] Explanation of sample_from_categorical

Hello, I am trying to understand your implementation.

In sample_from_output, you use the function sample_from_categorical to get the mixture component normal distribution you will sample from after.
It is commented:
# Alternative way to sample from categorical:
# m = np.random.choice(range(len(pis)), p=pis)

Do you have any reason for using your function sample_from_categorical instead of numpy's one ?

Transition to tf.keras

a short-term goal is to make tf.keras the main import and cut out regular keras. This is actually fully working in the tfkeras branch, but the examples haven't been updated.

I'm just selfishly waiting on finishing a research project before switching this over the master and releasing a new version.

Negative loss more clearly explanation

Hi,

I am trying to understand why the mdn loss here may be negative and hope you can help me!

I learned that the mdn loss function is defined as the negative log loss of PDF, the PDF is always bigger than 0 and smaller or equal to 1. Thus the negative log loss of PDF will always be positive. Why the negative loss is normal here.

I also check the get_mixture_loss_func and still cannot fully understand the negative loss logic, is the tfd.Mixture matters here?

Hope you can help me!

[activation] explanation of the value of the small addition in elu activation

Hello, I am trying to understand your implementation.

In the activation function:
def elu_plus_one_plus_epsilon(x):
"""ELU activation with a very small addition to help prevent NaN in loss."""
return (K.elu(x) + 1 + 1e-8)

Do you have any reason for using 1e-8 ? Default tensor flow fuzz factor K.epsilon() value is 1e-7, and going with a smaller value could (in theory) make some computations unstable. (don't have any example sorry).

Also, maybe using something like K.clip(K.elu(x) + 1 , K.epsilon()) or with your value K.clip(K.elu(x) + 1 , 1e-8) would make more sense in the activation function ?

Loading Saved Models

I was using the model for multi-dimensional prediction. Has some issues loading the model. What all custom objects and arguments need to be passed to load a saved Keras model.

Working with TF2.0

Hi and thanks for this MDN abstraction and NIME paper. Below is more of a feature request than an issue.

Have you thought of making your MDN layer work with TF2.0?

I am relatively new to TF and attempting to port your work to 2.0, you can see the changes I've made here. It seems to be working, except I am unable to save the model using model.save(), I get the error Unable to create link (name already exists)

If you have any thoughts, suggestions that would greatly appreciated as I'm pretty lost with that error ¯_(ツ)_/¯

Tests of sampling procedure

The sampling procedure sometimes seems to produce output with too much variance.

It should be straightforward to test this with values from a known distribution, do some sampling, and then calculate the sample mean and variance to make sure it is behaving correctly.

add a world model example

Bug when multiple outputs, std missing

Hi,

When using multiple y outputs with multiple mixtures, there seem to miss some std parameters:

For example, when 2 y outputs with 4 mixtures, when calling mdn_model.predict() that gives:

8 means
8 Probabilities
4 stds ???

I was expeting 8 stds as well, is there a reason why or it is a bug ?

Another example, if 2 y outputs with 2 mixtures:

Thanks!

Use of hyperthreading in hyperparameter sweeps

Hello back,

I aimed to run your code using Python hyperthreading module (I don't know if you are familiar with it) to speed up hyperparameter sweeps, as you may find in the file attached (it is in a .txt extension as I could not directly update a .py file).

However, most cases show how results hardly improve and strongly fluctuate even when using very low learning rates, opposite to when you do so sequentially without hyperthreading (where both loss and val_loss keep gradually decreasing).

Do you know if this is just an incompatibility issue or am I doing something wrong? How do you actually deal with hyperparameter sweeps? Are you using any scikit_tools or you do it manually such as I?

Thank you in advance, and best regards! May you have a nice weekend!

# x_data, y_data generated from a .csv file
SAMPLE = x_data.shape[0]
N_OUTPUTS = x_data.shape[1]
N_INPUTS = y_data.shape[1]

N_EPOCHS = [6000]
N_LAYERS = [1]
N_HIDDEN = [100]
N_MIXES = [8, 12]
DROPOUT = [0]
ACT_FUNCTION = 'tanh'
LR = [0.00005, 0.00001]
BATCH_SIZE = [NSAMPLE]
PTEST = [0.3]
beta1 = [0.9]
beta2 = [0.999]

def MDN(N_MIXES, LR, BATCH_SIZE, N_LAYERS, N_HIDDEN, DROPOUT, PTEST, N_EPOCHS, beta1, beta2):
	model = keras.Sequential()
	model.add(Dense(N_HIDDEN, batch_input_shape = (None, N_INPUTS), activation = ACT_FUNCTION))
	model.add(Dropout(DROPOUT))
	for layer in range(N_LAYERS - 1):
		model.add(Dense(N_HIDDEN, activation = ACT_FUNCTION))
		model.add(Dropout(DROPOUT))
	model.add(mdn.MDN(N_OUTPUTS, N_MIXES))
	return model	
	adam = keras.optimizers.Adam(lr=LR, beta_1 = beta1, beta_2 = beta2)
	model.compile(loss=mdn.get_mixture_loss_func(N_OUTPUTS,N_MIXES), optimizer=adam)	
	
	H = model.fit(x=x_data, y=y_data, verbose=0, batch_size=BATCH_SIZE, epochs=N_EPOCHS, validation_split=PTEST)
		
	return N_MIXES, LR, BATCH_SIZE, N_LAYERS, N_HIDDEN, DROPOUT, beta1, beta2, H.history['loss'], H.history['val_loss']
	
params = list(itertools.product(*[N_MIXES, LR, BATCH_SIZE, N_LAYERS, N_HIDDEN, DROPOUT, PTEST, N_EPOCHS, beta1, beta2]))

pool = ThreadPool()
results = pool.starmap(MDN, params)
pool.close()
pool.join()

Issues with temperature

Hello @cpmpercussion,

Here I am back again to ask you a new question about your code! This time, regarding temperature sampling.

My reasoning tells me that temperature sampling for 'mu' should be as close to 1 as possible, as the selection of one gaussian ought not to discriminate the others. When it comes to temperature sampling for sigma, I guess that we could make use of values of sigma_temp as close to 0 as possible, but I cannot tell the exact reason why.

I did some grid search for different values of sigma_temp and temp and my results seem to be better when both of them are as small as possible. But when I set both sigma_temp and temp to 1e-4, I receive the following regression line as output (my outputs are normalized between -1 and 1, and the blue line represents the expected -target values):

Does it exist a clear 'rule of thumb' for this model about how to perform temperature sampling for sigma and mu values? Do you find any reason why I am having this weird shape as output?

Many thanks in advance, and may you have a nice day!