Correctly applying dropout in CNTK

Correctly applying dropout in CNTK - neural-network

I'm applying dropout as follows in a three hidden layer feed-forward network, using the Python API. My results are not very good and I wonder if I'm misapplying the dropout layer- is it better to apply it to the input of the dense layer, or internally, to the output of the first linear layer?
def dense_layer(input, output_dim, nonlinearity):
r = linear_layer(input, output_dim)
r = dropout(r, 0.25)
r = nonlinearity(r)
return r;

If 0 dropout works better, why you believe that you need a dropout? Does your network overfit? Do you have other regularization? It would be good to have more detail on the network architecture and the data.

Related

Predict a 2D matrix from an image with keras keeping spatial information

I want to train a CNN to predict 100x100x1 matrices (heatmaps) from 224x224x13 images using Keras. My idea is to finetune pretrained networks that keras provide (resnet, Xception, vgg16 etc.).
The first step is then to substitute the pretrained top layers for the ones that meet my problem constraints. I am trying to predict 100x100x1 heatmap images whose values range from 0 to 1. Therefore I want the output of my network to be a 100x100x1 matrix. I believe that if I use Flatten and then a Dense layer of 1000x1x1 I will be loosing spatial information, which I don't want (right?).
I want my code to be flexible, to be able to run independent from which pretrained architecture is being used (I have to run many experiments). Therefore I want to stack a Dense layer that connects to every unit of whatever kind of layer is before it (which will depend on the pretrained architecture I will be using).
Some answers relate to the fully convolutional approach, but that is not what I mean here. Both my X and Y have fixed shapes (224x224x3 and 100x100x1 respectively).
My problem is that I don't now how to stack the new layer/s in such a way that the predictions/outputs of the net are 100x100x1 matrices.
As it has been suggested in the answers, I am trying to add a 100x100x1 Dense layer. But I don't seem to get it working:
If for example I to like this:
x = self.base_model.output
predictions = keras.layers.Dense(input_shape = (None, 100,100), units= 1000, activation='linear')(x)
self.model = keras.models.Model(input=self.base_model.input, output=predictions)
Then I got this when I start training:
ValueError: Error when checking target: expected dense_1 to have 4 dimensions, but got array with shape (64, 100, 100)
The Y of the network are indeed batches of shape (64,100,100)
Any suggestions?
Also, which loss function should I use? As it has been suggested in the answers, I could use mse but I wonder, is there any loss function that is able to measure the spatial information of my desired 100x100x1 output?
Thanks in advance.
EDIT:
I semi-solved my problem thanks to #ncasas answer:
I just added some deconvolutional layers until I got an output that was similar to the 100x100x1. This is not what I wanted on the first place, since this implementation is not agnostic to the pretrained architecture that is built on top of. For Xception with input_shape = (224, 224, 3), this top layers give an output of 80x80x1:
x = self.base_model.output
x = keras.layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.UpSampling2D((3, 3))(x)
x = keras.layers.Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.UpSampling2D((2, 2))(x)
x = keras.layers.Conv2D(16, (3, 3), activation='relu')(x)
x = keras.layers.UpSampling2D((2, 2))(x)
predictions = keras.layers.Conv2D(filters = 1,
kernel_size = (3, 3),
activation='sigmoid',
padding='same')(x)
self.model = keras.models.Model(input=self.base_model.input, output=predictions)
where self.base_model is keras.applications.Xception(weights = 'imagenet', include_top = False, input_shape = (224, 224, 3))
I am finally using mse as loss function and it works just fine.

What you are describing is a multidimiensional linear regression plus transfer learning.
In order to reuse the first layers of a trained Keras model, you can follow this post from the Keras blog, in section "Using the bottleneck features of a pre-trained network: 90% accuracy in a minute". For your case, the only difference is that:
For the layer before the last one, you should probably have something larger than 256.
The last layer would be a 10000 units Dense layer with linear activations (i.e. no activation at all). You can either reshape your expected outputs from 100x100 to 100000, or add an extra reshape layer to the network to have a 100x100 output.
Keep in mind that between the convolutional part of the network and the multilayer perceptron part (i.e. the final Dense layer(s)) there must be a Flatten layer to place the obtain activation patterns in a single matrix (search the linked post for "Flatten"); the error you receive is because of that.
If you don't want to flatten the activation patterns, you may want to directly use deconvolutions in your last layers. For that, you can take a look at the keras autoencoder tutorial, at section "Convolutional autoencoder".
The usual loss function used for regression problems is mean squared error (MSE). It does not make sense to use cross entropy for regression, as explained here.

You should find this paper helpful. Simply replace the fully connected layers with convolutional layers. Instead of a single prediction for the entire image, the result will be a heatmap of predictions for smaller portions of the image.
You should use the categorical_crossentropy loss function.

How to use Deep Neural Networks for regression?

I wrote this script (Matlab) for classification using Softmax. Now I want to use same script for regression by replacing the Softmax output layer with a Sigmoid or ReLU activation function. But I wasn't able to do that.
X=houseInputs ;
T=houseTargets;
%Train an autoencoder with a hidden layer of size 10 and a linear transfer function for the decoder. Set the L2 weight regularizer to 0.001, sparsity regularizer to 4 and sparsity proportion to 0.05.
hiddenSize = 10;
autoenc1 = trainAutoencoder(X,hiddenSize,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',4,...
'SparsityProportion',0.05,...
'DecoderTransferFunction','purelin');
%%
%Extract the features in the hidden layer.
features1 = encode(autoenc1,X);
%Train a second autoencoder using the features from the first autoencoder. Do not scale the data.
hiddenSize = 10;
autoenc2 = trainAutoencoder(features1,hiddenSize,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',4,...
'SparsityProportion',0.05,...
'DecoderTransferFunction','purelin',...
'ScaleData',false);
features2 = encode(autoenc2,features1);
%%
softnet = trainSoftmaxLayer(features2,T,'LossFunction','crossentropy');
%Stack the encoders and the softmax layer to form a deep network.
deepnet = stack(autoenc1,autoenc2,softnet);
%Train the deep network on the wine data.
deepnet = train(deepnet,X,T);
%Estimate the deep network, deepnet.
y = deepnet(X);

Regression is a different problem from classification. You have to change your loss function to something that fits with a regression e.g. mean square error and of course change the number of neuron to one (you will only ouput 1 value on your last layer).

It is possible to use a Neural Network to perform a regression task but it might be an overkill for many tasks. True regression means to perform a mapping of one set of continuous inputs to another set of continuous outputs:
f: x -> ý
Changing the architecture of a neural network to make it perform a regression task is usually fairly simple. Instead of mapping the continuous input data to a specific class as it is done using the Softmax function as in your case, you have to make the network use only a single output node.
This node will just sum the outputs of the the previous layer (last hidden layer) and multiply the summed activations by 1. During the training process this output ý will be compared to the correct ground-truth value y that comes with your dataset. As a loss function you may use the Root-means-squared-error (RMSE).
Training such a network will result in a model that maps an arbitrary number of independent variables x to a dependent variable ý, which basically is a regression task.
To come back to your Matlab implementation, it would be incorrect to change the current Softmax output layer to be an activation function such as a Sigmoid or ReLU. Instead your would have to implement a custom RMSE output layer for your network, which is fed with the sum of activations coming from the last hidden layer of your network.

Is it desirable to scale data for skflow.TensorFlowDNNClassifier?

My colleagues and this question on Cross Validated say you should transform data to zero mean and unit variance for neural networks. However, my performance was slightly worse with scaling than without.
I tried using:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
steps = 5000
def exp_decay(global_step):
return tf.train.exponential_decay(
learning_rate=0.1, global_step=global_step,
decay_steps=steps, decay_rate=0.01)
random.seed(42) # to sample data the same way
classifier = skflow.TensorFlowDNNClassifier(
hidden_units=[150, 150, 150],
n_classes=2,
batch_size=128,
steps=steps,
learning_rate=exp_decay)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Did I do something wrong or is scaling not necessary?

Usually scaling benefits most for models that don't have regularization and linear models. For example simple mean squared error loss (like in TensorFlowLinearRegressor) without regularization won't work very well on not scaled data.
In your case you are using classifier that runs softmax regularization and you are using DNN, so scaling is not needed. DNNs themselve can model rescaling (via bias and weight on the feature in the first layer) if that's a useful thing to do.

TensorFlow or Theano: how do they know the loss function derivative based on the neural network graph?

In TensorFlow or Theano, you only tell the library how your neural network is, and how feed-forward should operate.
For instance, in TensorFlow, you would write:
with graph.as_default():
_X = tf.constant(X)
_y = tf.constant(y)
hidden = 20
w0 = tf.Variable(tf.truncated_normal([X.shape[1], hidden]))
b0 = tf.Variable(tf.truncated_normal([hidden]))
h = tf.nn.softmax(tf.matmul(_X, w0) + b0)
w1 = tf.Variable(tf.truncated_normal([hidden, 1]))
b1 = tf.Variable(tf.truncated_normal([1]))
yp = tf.nn.softmax(tf.matmul(h, w1) + b1)
loss = tf.reduce_mean(0.5*tf.square(yp - _y))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
I am using L2-norm loss function, C=0.5*sum((y-yp)^2), and in the backpropagation step presumably the derivative will have to be computed, dC=sum(y-yp). See (30) in this book.
My question is: how can TensorFlow (or Theano) know the analytical derivative for backpropagation? Or do they do an approximation? Or somehow do not use the derivative?
I have done the deep learning udacity course on TensorFlow, but I am still at odds at how to make sense on how these libraries work.

The differentiation happens in the final line:
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
When you execute the minimize() method, TensorFlow identifies the set of variables on which loss depends, and computes gradients for each of these. The differentiation is implemented in ops/gradients.py, and it uses "reverse accumulation". Essentially it searches backwards from the loss tensor to the variables, applying the chain rule at each operator in the dataflow graph. TensorFlow includes "gradient functions" for most (differentiable) operators, and you can see an example of how these are implemented in ops/math_grad.py. A gradient function can use the original op (including its inputs, outputs, and attributes) and the gradients computed for each of its outputs to produce gradients for each of its inputs.
Page 7 of Ilya Sutskever's PhD thesis has a nice explanation of how this process works in general.

How does the linear transfer function in perceptrons (artificial neural network) work?

I know how the step transfer function works but how does the linear transfer function work? What equation do you use?
Relate answer to AND gate with two inputs and a bias

First of all, in general you want to apply linear transfer function only in the output layer of an MLP and "never" in the hidden layers, where non-linear transfer functions are typically used (logistic function, step. etc.).
Linear transfer function (in the form of f(x) = x for pure linear or purelin as it is mentioned in literature) is typically used for function approximation / regression tasks (this is intuitive because step and logistic functions give binary results where the linear function gives continuous results).
Non- linear transfer functions are used for classification tasks.

Non-linear transfer function(aka: activation function) is the most important factor which assigns the nonlinear approximation capability to the simple fully connected multilayer neural network.
Nevertheless, 'linear' activation function, of course, is one of the many alternatives you might want to adopt. But the problem is, pure linear transfer(f(x) = x) in hidden layers doesn't make sense for us, which means it may be 'in vain' if we try to train a network whose hidden units are activated by pure linear function.
We may understand this process with the following:
Assuming f(x)=x is our activation function, and we try to train a single hidden layer network having 2 input units(x1,x2), 3 hidden units(a1,a2,a3) and 1 output unit(y).
Hence, the network tries to approximate the function :
# hidden units
a1 = f(w11*x1+w12*x2+b1) = w11*x1+w12*x2+b1
a2 = f(w21*x1+w22*x2+b2) = w21*x1+w22*x2+b2
a3 = f(w31*x1+w32*x2+b3) = w31*x1+w32*x2+b3
# output unit
y = c1*a1+c2*a2+c3*a3+b4
if we combine all these equations, it turns out:
y = c1(w11*x1+w12*x2+b1) + c2(w21*x1+w22*x2+b2) + c3(w31*x1+w32*x2+b3) + b4
= (c1*w11+c2*w21+c3*w31)*x1 + (c1*w12+c2*w22+c3*w32)*x2 + (c1*b1+c2*b2+c3*b3+b4)
= A1*x1+A2*x2+C
As shown above, linear activation degenerate the network into a single input-output linear product, regardless of the structure of the network. What was done during the training process is factorizing A1, A2 and C into various factors.
Even one very popular quasi-linear activation function call Relu in deep neural network is also rectified. In other words, no pure linear activation in hidden layers is used unless you want to factorize coefficients.