Using hidden activations in loss function - neural-network

I want to create a custom loss function for a double-input double-output model in Keras that:
minimizes the reconstruction error of two autoencoders;
maximizes the correlation of the bottleneck features of the autoencoders.
For this I need to pass to the loss function:
both inputs;
both outputs / reconstructions;
output of intermediate layers for both (hidden activations).
I know I can pass both inputs and outputs to Model, but am struggling to find a way to pass the hidden activations.
I could create two new Models that have the output of the intermediate layers and pass that to loss, like:
intermediate_layer_model1 = Model(input=input1, output=autoencoder.get_layer('encoded1').output)
intermediate_layer_model2 = Model(input=input2, output=autoencoder.get_layer('encoded2').output)
autoencoder.compile(optimizer='adadelta', loss=loss(intermediate_layer_model1, intermediate_layer_model2))
But still, I would need to find a way to match the y_true in loss to the correct intermediate model.
What is the right way to approach this?
Edit
Here's an approach that I think should work. Simplified:
# autoencoder 1
input1 = Input(shape=(input_dim,))
encoded1 = Dense(encoding_dim, activation='relu', name='encoded1')(input1)
decoded1 = Dense(input_dim, activation='sigmoid', name='decoded1')(encoded1)
# autoencoder 2
input2 = Input(shape=(input_dim,))
encoded2 = Dense(encoding_dim, activation='relu', name='encoded2')(input2)
decoded2 = Dense(input_dim, activation='sigmoid', name='decoded2')(encoded2)
# merge encodings
merge_layer = merge([encoded1, encoded2], mode='concat', name='merge', concat_axis=1)
model = Model(input=[input1, input2], output=[decoded1, decoded2, merge_layer])
model.compile(optimizer='rmsprop', loss={
'decoded1': 'binary_crossentropy',
'decoded2': 'binary_crossentropy',
'merge': correlation,
})
Then in correlation I can split y_pred and do the calculations.

How about:
Defining a single model with a multiple outputs (be sure that you named a coding and reconstruction layer properly):
duo_model = Model(input=input, output=[coding_layer, reconstruction_layer])
Compiling your model with two different losses (or even performing a loss reweighting):
duo_model.compile(optimizer='rmsprop',
loss={'coding_layer': correlation_loss,
'reconstruction_layer': 'mse'})
Taking your final model as a:
encoder = Model(input=input, output=[coding_layer])
autoencoder = Model(input=input, output=[reconstruction_layer])
After proper compilation this should do the job.
When it comes to defining a proper correlation loss function there are two ways:
when coding layer and your output layer have the same dimension -
you could easly use predefinied cosine_proximity function from
Keras library.
when coding layer has different dimensonality -
you shoud first find embedding of coding vector and reconstruction vector to the same space and then - compute correlation there. Remember that this embedding should either be a Keras layer / function or Theano / Tensor flow operation (depending on which backend you are using). Of course you can compute both embedding and correlation function as a part of one loss function.

Related

NSLocalizedDescription = "The size of the output layer 'Identity' in the neural network does not match the number of classes in the classifier."

I just created a model that does a binary classification and has a dense layer of 1 unit at the end. I used Sigmoid activation. However, I get this error now when I wanna convert it to CoreML.
I tried to change the number of units to 2 and activation to softmax but still didn't work.
import coremltools as ct
#1. define input size
image_input = ct.ImageType(scale=1/255)
#2. give classifier
classifier_config = coremltools.ClassifierConfig(class_labels=[0, 1]) #ERROR here
#3. convert the model
coreml_model = coremltools.convert("mask_detection_model_surgical_mask.h5",
inputs=[image_input], classifier_config=classifier_config)
#4. load and resize an example image
example_image = Image.open("Unknown3.jpg").resize((256, 256))
# Make a prediction using Core ML
out_dict = coreml_model.predict({mymodel.input_names[0]: example_image})
print(out_dict["classLabels"])
# save to disk
#coreml_model.save("FINALLY.mlmodel")
I found the answer to my question.
Use Softmax activation and 2 Dense units as the final layer with either loss='binary_crossentropy' or `loss='categorical_crossentropy'
Good luck to hundreds of people who posted a similar question but received no answer.

pretrained densenet/vgg16/resnet50 + gp does not train on cifar10 data

I'm trying to train a hybrid model with GP on top of pre-trained CNN (Densenet, VGG and Resnet) with CIFAR10 data, mimic the ex2 function in the gpflow document. But the testing result is always between 0.1~0.2, which generally means random guess (Wilson+2016 paper shows hybrid model for CIFAR10 data should get accuracy of 0.7). Could anyone give me a hint of what could be wrong?
I've tried same code with simpler cnn models (2 conv layer or 4 conv layer) and both have reasonable results. I've tried to use different Keras applications: Densenet121, VGG16, ResNet50, neither works. I've tried to freeze the weights in the pre-trained models still not working.
def cnn_dn(output_dim):
base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(32,32,3))
bout = base_model.output
fcl = GlobalAveragePooling2D()(bout)
#for layer in base_model.layers:
# layer.trainable = False
output=Dense(output_dim, activation='relu')(fcl)
md=Model(inputs=base_model.input, outputs=output)
return md
#add gp on top, reference:ex2() function in
#https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/tailor/gp_nn.ipynb
#needs to slightly change build graph part because keras variable #sharing is not the same as tensorflow
#......
## build graph
with tf.variable_scope('cnn'):
md=cnn_dn(gp_dim)
f_X = tf.cast(md(X), dtype=float_type)
f_Xtest = tf.cast(md(Xtest), dtype=float_type)
#......
## predict
res=np.argmax(sess.run(my, feed_dict={Xtest:xts}),1).reshape(yts.shape)
correct = res == yts.astype(int)
print(np.average(correct.astype(float)))
I finally figure out that the solution is training larger iterations. In the original code, I just use 50 iterations as used in the ex2() function for MNIST data and it is not enough for more complicated network and CIFAR10 data. Adjusting some hyper-parameter (e.g. learning rate and activation function) also helps.

implementing a MLP model in keras for timeseries prediction but the model doesn't train well

I'm trying to come up with a MLP model for timeseries prediction following this blog post. I have 138 timeseries with a lookback_window=28 (splitted as 50127 timeseries for traing and 24255 timeseries for validation). I need to predict the next value (timesteps=28, n_features=1). I started from a 3 layer network but it didn't train well. I tried to make the network deeper by adding more layers/more hunits, but it doesn't improve. In the picture, you can see the result of prediction of the following model Here is my model code:
inp = Input(batch_shape=(batch_size, lookback_window))
first_layer = Dense(1000, input_dim=28, activation='relu')(inp)
snd_layer = Dense(500)(first_layer)
thirs_layer = Dense(250)(snd_layer)
tmp = Dense(100)(thirs_layer)
tmp2 = Dense(50)(tmp)
tmp3 = Dense(25)(tmp2)
out = Dense(1)(tmp3)
model = Model(inp, out)
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(train_data, train_y,
epochs=1000,
batch_size=539,
validation_data=(validation_data, validation_y),
verbose=1,
shuffle=False)
What am I missing? How can I improve it?
The main thing I noticed is that you are not using non-linearities in your layers. I would use relus for the hidden layers and linear layer for the final layer in case you want values larger than 1 / -1 to be possible. If you do not want them to be possible use tanh. By increasing the data you make the problem harder and therefore your mostly linear model is underfitting severely.
I managed to get better results by the following changes:
Using RMSprop instead of Adam with lr=0.001, and as #TommasoPasini mentioned added them to all Dense layers (expect the last one). It improves the results a lot!
epochs= 3000 instead of 1000.
But now I think it is overfitting. Here are the plots of the results and the validation and train loss:

Using SparseTensor as a trainable variable?

I'm trying to use SparseTensor to represent weight variables in a fully-connected layer.
However, it seems that TensorFlow 0.8 doesn't allow to use SparseTensor as tf.Variable.
Is there any way to go around this?
I've tried
import tensorflow as tf
a = tf.constant(1)
b = tf.SparseTensor([[0,0]],[1],[1,1])
print a.__class__ # shows <class 'tensorflow.python.framework.ops.Tensor'>
print b.__class__ # shows <class 'tensorflow.python.framework.ops.SparseTensor'>
tf.Variable(a) # Variable is declared correctly
tf.Variable(b) # Fail
By the way, my ultimate goal of using SparseTensor is to permanently mask some of connections in dense form. Thus, these pruned connections are ignored while calculating and applying gradients.
In my current implementation of MLP, SparseTensor and its sparse form of matmul ops successfully reports inference outputs. However, the weights declared using SparseTensor aren't trained as training steps go.
As a workaround to your problem, you can provide a tf.Variable (until Tensorflow v0.8) for the values of a sparse tensor. The sparsity structure has to be pre-defined in that case, the weights however remain trainable.
weights = tf.Variable(<initial-value>)
sparse_var = tf.SparseTensor(<indices>, weights, <shape>) # v0.8
sparse_var = tf.SparseTensor(<indices>, tf.identity(weights), <shape>) # v0.9
TensorFlow doesn't currently support sparse tensor variables. However, it does support sparse lookups (tf.embedding_lookup) and sparse gradient updates (tf.sparse_add) of dense variables. I suspect these two will suffice your use case.
TensorFlow doesn't support training on sparse tensors yet. You can initialize a sparse tensor as you wish, then convert it into a dense tensor and create a variable from it like that:
# You need to correctly initialize the sparse tensor with indices, values and a shape
b = tf.SparseTensor(indices, values, shape)
b_dense = tf.sparse_tensor_to_dense(b)
b_variable = tf.Variable(b_dense)
Now you have initialized a sparse tensor as a variable. Now you need to take care of the gradient update (in other words, make sure the entries in the variable stay 0, since there is a non-vanishing gradient calculated in the backpropagation algorithm for them when using this naively).
In order to do this, TensorFlow optimizers have a method called tf.train.Optimizer.compute_gradients(loss, [list_of_variables]). This calculates all the gradients in the graph necessary to minimize the loss function, but doesn't apply them yet. This method returns a list of tuples in a form of (gradients, variable). You can modify these gradients freely, but in your case it makes sense to mask the gradients not needed to 0 (i.e. by creating another sparse tensor with default values 0.0 and values 1.0 where the weights in your network are present).
After having modified them, you call the optimizer method tf.train.Optimizer.apply_gradients(grads_and_vars) to actually apply the gradients. An example code would look like this:
# Create optimizer instance
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
# Get the gradients for your weights
grads_and_vars = optimizer.compute_gradients(loss, [b_variable])
# Modify the gradients at will
# In your case it would look similar to this
modified_grads_and_vars = [(tf.multiply(gv[0], mask_tensor), gv[1] for gv in grads_and_vars]
# Apply modified gradients to your model
optimizer.apply_gradients(modified_grads_and_vars)
This makes sure your entries stay 0 in your weight matrix and no unwanted connections are created. You need to take care of all the other gradients for all other variables later.
The above code works with some minor correction like this.
def optimize(loss, mask_tensor):
optimizer = tf.train.AdamOptimizer(0.001)
grads_and_vars = optimizer.compute_gradients(loss)
modified_grads_and_vars = [
(tf.multiply(gv[0], mask_tensor[gv[1]]), gv[1]) for gv in grads_and_vars
]
return optimizer.apply_gradients(modified_grads_and_vars)

TensorFlow Training

Assuming I have a very simple neural network, like multilayer perceptron. For each layer the activation function is sigmoid and the network are fully connected.
In TensorFlow this might be defined like this:
sess = tf.InteractiveSession()
# Training Tensor
x = tf.placeholder(tf.float32, shape = [None, n_fft])
# Label Tensor
y_ = tf.placeholder(tf.float32, shape = [None, n_fft])
# Declaring variable buffer for weights W and bias b
# Layer structure [n_fft, n_fft, n_fft, n_fft]
# Input -> Layer 1
struct_w = [n_fft, n_fft]
struct_b = [n_fft]
W1 = weight_variable(struct_w, 'W1')
b1 = bias_variable(struct_b, 'b1')
h1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1)
# Layer1 -> Layer 2
W2 = weight_variable(struct_w, 'W2')
b2 = bias_variable(struct_b, 'b2')
h2 = tf.nn.sigmoid(tf.matmul(h1, W2) + b2)
# Layer2 -> output
W3 = weight_variable(struct_w, 'W3')
b3 = bias_variable(struct_b, 'b3')
y = tf.nn.sigmoid(tf.matmul(h2, W3) + b3)
# Calculating difference between label and output using mean square error
mse = tf.reduce_mean(tf.square(y - y_))
# Train the Model
# Gradient Descent
train_step = tf.train.GradientDescentOptimizer(0.3).minimize(mse)
The design target for this model is to map a n_fft points fft spectrogram to another n_fft target spectrogram. Let's assume both the training data and target data are of size [3000, n_fft]. They are stored in variables spec_train and spec_target.
Now here comes the question. For TensorFlow is there any difference between these two trainings?
Training 1:
for i in xrange(200):
train_step.run(feed_dict = {x: spec_train, y_: spec_target})
Training 2:
for i in xrange(200):
for j in xrange(3000):
train = spec_train[j, :].reshape(1, n_fft)
label = spec_target[j, :].reshape(1, n_fft)
train_step.run(feed_dict = {x: train, y_: label})
Thank you very much!
In the first training version, you are training the entire batch of training data at once, which means that the first and the 3000th element of spec_train will be processed using the same model parameters in a single step. This is known as (Batch) Gradient Descent.
In the second training version, you are training a single example from the training data at once, which means that the 3000th element of spec_train will be processed using model parameters that have been updated 2999 times since the first element was most recently processed. This is known as Stochastic Gradient Descent (or it would be if the element was selected at random).
In general, TensorFlow is used with datasets that are too large to process in one batch, so mini-batch SGD (where a subset of the examples are processed in one step) is favored. Processing a single element at a time is theoretically desirable, but is inherently sequential and has high fixed costs because the matrix multiplications and other operations are not as computationally dense. Therefore, processing a small batch (e.g. 32 or 128) of examples at once is the usual approach, with multiple replicas training on different batches in parallel.
See this Stats StackExchange question for a more theoretical discussion of when you should use one approach versus the other.
Yes there's a difference. I think the second way loss function can be bit messy. It's more like online training. For each data point in the whole batch you update all of your parameters. But in the first way it's called the batch gradient where you take one batch at a time and take the average loss then update the parameters.
Please refer this link
https://stats.stackexchange.com/questions/49528/batch-gradient-descent-versus-stochastic-gradient-descent
First answer is really good in this link