How to add different priors for each of ARD lengthscales hyperparameters in GPflow - gpflow

I've seen how to add a prior to the lengthscales hyperpameter in the MCMC notebook:
model.kernel.lengthscales.prior = tfd.Gamma(f64(1.0), f64(1.0))
From above, it seems we can only add a prior to a single length-scale of the RBF (radial basis function) kernel. I would like to add a prior to each of the lengthscales of a ARD (Automatic Relevance Determination) kernel. Any suggestions on how to do this is much appreciated.

Your code snippet,
model.kernel.lengthscales.prior = tfd.Gamma(f64(1.0), f64(1.0))
does add a prior to all the lengthscales of your ARD RBF kernel, but it's the same prior for each dimension.
To assign different priors to different dimensions of the ARD lengthscale, you can simply use the batch feature of tfp Distributions, e.g.
model.kernel.lengthscales.prior = tfd.Gamma(
concentration=np.array([1.0, 2.0]),
rate=np.array([3.0, 4.0]),
)
You can probably make it work for different distributions, too (there's tfd.JointDistribution); for that you'll have to consult the tensorflow_probability docs.

Related

How to implement weight decay in tensorflow as in Caffe

In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by:
W = W * (1 - 0.0005)
after we applied the gradient to it.
I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)
I my experiences, I often run into numerical problems aften 100k iterations during training.
I also go through related questions at stackoverflow, e.g.,
How to set weight cost strength in TensorFlow?
However, the solution seems a little different as implemented in Caffe.
Does anyone has similar concerns? Thank you.
The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.
This is a duplicate question:
How to define weight decay for individual layers in TensorFlow?
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.

Using SparseTensor as a trainable variable?

I'm trying to use SparseTensor to represent weight variables in a fully-connected layer.
However, it seems that TensorFlow 0.8 doesn't allow to use SparseTensor as tf.Variable.
Is there any way to go around this?
I've tried
import tensorflow as tf
a = tf.constant(1)
b = tf.SparseTensor([[0,0]],[1],[1,1])
print a.__class__ # shows <class 'tensorflow.python.framework.ops.Tensor'>
print b.__class__ # shows <class 'tensorflow.python.framework.ops.SparseTensor'>
tf.Variable(a) # Variable is declared correctly
tf.Variable(b) # Fail
By the way, my ultimate goal of using SparseTensor is to permanently mask some of connections in dense form. Thus, these pruned connections are ignored while calculating and applying gradients.
In my current implementation of MLP, SparseTensor and its sparse form of matmul ops successfully reports inference outputs. However, the weights declared using SparseTensor aren't trained as training steps go.
As a workaround to your problem, you can provide a tf.Variable (until Tensorflow v0.8) for the values of a sparse tensor. The sparsity structure has to be pre-defined in that case, the weights however remain trainable.
weights = tf.Variable(<initial-value>)
sparse_var = tf.SparseTensor(<indices>, weights, <shape>) # v0.8
sparse_var = tf.SparseTensor(<indices>, tf.identity(weights), <shape>) # v0.9
TensorFlow doesn't currently support sparse tensor variables. However, it does support sparse lookups (tf.embedding_lookup) and sparse gradient updates (tf.sparse_add) of dense variables. I suspect these two will suffice your use case.
TensorFlow doesn't support training on sparse tensors yet. You can initialize a sparse tensor as you wish, then convert it into a dense tensor and create a variable from it like that:
# You need to correctly initialize the sparse tensor with indices, values and a shape
b = tf.SparseTensor(indices, values, shape)
b_dense = tf.sparse_tensor_to_dense(b)
b_variable = tf.Variable(b_dense)
Now you have initialized a sparse tensor as a variable. Now you need to take care of the gradient update (in other words, make sure the entries in the variable stay 0, since there is a non-vanishing gradient calculated in the backpropagation algorithm for them when using this naively).
In order to do this, TensorFlow optimizers have a method called tf.train.Optimizer.compute_gradients(loss, [list_of_variables]). This calculates all the gradients in the graph necessary to minimize the loss function, but doesn't apply them yet. This method returns a list of tuples in a form of (gradients, variable). You can modify these gradients freely, but in your case it makes sense to mask the gradients not needed to 0 (i.e. by creating another sparse tensor with default values 0.0 and values 1.0 where the weights in your network are present).
After having modified them, you call the optimizer method tf.train.Optimizer.apply_gradients(grads_and_vars) to actually apply the gradients. An example code would look like this:
# Create optimizer instance
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
# Get the gradients for your weights
grads_and_vars = optimizer.compute_gradients(loss, [b_variable])
# Modify the gradients at will
# In your case it would look similar to this
modified_grads_and_vars = [(tf.multiply(gv[0], mask_tensor), gv[1] for gv in grads_and_vars]
# Apply modified gradients to your model
optimizer.apply_gradients(modified_grads_and_vars)
This makes sure your entries stay 0 in your weight matrix and no unwanted connections are created. You need to take care of all the other gradients for all other variables later.
The above code works with some minor correction like this.
def optimize(loss, mask_tensor):
optimizer = tf.train.AdamOptimizer(0.001)
grads_and_vars = optimizer.compute_gradients(loss)
modified_grads_and_vars = [
(tf.multiply(gv[0], mask_tensor[gv[1]]), gv[1]) for gv in grads_and_vars
]
return optimizer.apply_gradients(modified_grads_and_vars)

GMM in MATLAB gives different results for the same file

I constructed a Gaussian Mixture Model in Matlab with a dataset:
model = gmdistribution.fit(data,M,'Replicates',5);
with M = 3 Gaussian components. I tested new data with:
[P, l] = posterior(model,new_data);
I ran the program several times and didn't get the same result. Each run produces different log-likelihood values. I use the log-likelihood to make decisions, and this value for the same data (new_data) differs for each run. What does it depend on? How can I resolve this problem?
First, assuming that you're using a newish version of Matlab, the gmdistribution.fit documentation indicates that the fit method is deprecated and that fitgmdist should be used. See here for an example.
Second, the documentation for gmdistribution.fit indicates that if the 'Replicates' option is larger than 1, the 'randSample' start method will be used to produce the initial parameters. This may be the cause (or at least one of the causes) of your observed variability.
Finally, you can also try using rng before calling gmdistribution.fit to set the seed of the global random number stream (assuming the function doesn't use it's own stream internally). Alternatively, you can try specifying an 'Options' parameter via statset:
seed = 1;
s = RandStream('mt19937ar','Seed',seed);
opts = statset('Streams',s);
model = gmdistribution.fit(data,M,'Replicates',5,'Options',opts);
I can't test this fully myself – see the gmdistribution class documentation for further details.

LIBSVM in MATLAB/Octave - what's the output of libsvmread?

The second output of the libsvmread command is a set of features for each given training example.
For example, in the following MATLAB command:
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
This second variable (heart_scale_inst) holds content in a form that I don't understand, for example:
<1, 1> -> 0.70833
What is the meaning of it? How is it to be used (I can't plot it, the way it is)?
PS. If anyone could please recommend a good LIBSVM tutorial, I'd appreciate it. I haven't found anything useful and the README file isn't very clear... Thanks.
The definitive tutorial for LIBSVM for beginners is called: A Practical Guide to SVM Classification it is available from the site of the authors of LIBSVM.
The second parameter returned is called the instance matrix. It is a matrix, let call it M, M(1,:) are the features of data point 1 and so on. The matrix is sparse that is why it prints out weirdly. If you want to see it fully print full(M).
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
with heart_scale_label and heart_scale_inst you should be able to train an SVM by issuing:
mod = svmtrain(heart_scale_label,heart_scale_inst,'-c 1 -t 0');
I strong suggest you read the above linked guide to learn how to set the c parameter (and possibly, in case of RBF kernel the gamma parameter), but the above line is how you would train with that data.
I think it is the probability with which test case has been predicted to heart_scale label category