TensorFlow Training - neural-network

TensorFlow Training - neural-network

Assuming I have a very simple neural network, like multilayer perceptron. For each layer the activation function is sigmoid and the network are fully connected.
In TensorFlow this might be defined like this:
sess = tf.InteractiveSession()
# Training Tensor
x = tf.placeholder(tf.float32, shape = [None, n_fft])
# Label Tensor
y_ = tf.placeholder(tf.float32, shape = [None, n_fft])
# Declaring variable buffer for weights W and bias b
# Layer structure [n_fft, n_fft, n_fft, n_fft]
# Input -> Layer 1
struct_w = [n_fft, n_fft]
struct_b = [n_fft]
W1 = weight_variable(struct_w, 'W1')
b1 = bias_variable(struct_b, 'b1')
h1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1)
# Layer1 -> Layer 2
W2 = weight_variable(struct_w, 'W2')
b2 = bias_variable(struct_b, 'b2')
h2 = tf.nn.sigmoid(tf.matmul(h1, W2) + b2)
# Layer2 -> output
W3 = weight_variable(struct_w, 'W3')
b3 = bias_variable(struct_b, 'b3')
y = tf.nn.sigmoid(tf.matmul(h2, W3) + b3)
# Calculating difference between label and output using mean square error
mse = tf.reduce_mean(tf.square(y - y_))
# Train the Model
# Gradient Descent
train_step = tf.train.GradientDescentOptimizer(0.3).minimize(mse)
The design target for this model is to map a n_fft points fft spectrogram to another n_fft target spectrogram. Let's assume both the training data and target data are of size [3000, n_fft]. They are stored in variables spec_train and spec_target.
Now here comes the question. For TensorFlow is there any difference between these two trainings?
Training 1:
for i in xrange(200):
train_step.run(feed_dict = {x: spec_train, y_: spec_target})
Training 2:
for i in xrange(200):
for j in xrange(3000):
train = spec_train[j, :].reshape(1, n_fft)
label = spec_target[j, :].reshape(1, n_fft)
train_step.run(feed_dict = {x: train, y_: label})
Thank you very much!

In the first training version, you are training the entire batch of training data at once, which means that the first and the 3000th element of spec_train will be processed using the same model parameters in a single step. This is known as (Batch) Gradient Descent.
In the second training version, you are training a single example from the training data at once, which means that the 3000th element of spec_train will be processed using model parameters that have been updated 2999 times since the first element was most recently processed. This is known as Stochastic Gradient Descent (or it would be if the element was selected at random).
In general, TensorFlow is used with datasets that are too large to process in one batch, so mini-batch SGD (where a subset of the examples are processed in one step) is favored. Processing a single element at a time is theoretically desirable, but is inherently sequential and has high fixed costs because the matrix multiplications and other operations are not as computationally dense. Therefore, processing a small batch (e.g. 32 or 128) of examples at once is the usual approach, with multiple replicas training on different batches in parallel.
See this Stats StackExchange question for a more theoretical discussion of when you should use one approach versus the other.

Yes there's a difference. I think the second way loss function can be bit messy. It's more like online training. For each data point in the whole batch you update all of your parameters. But in the first way it's called the batch gradient where you take one batch at a time and take the average loss then update the parameters.
Please refer this link
https://stats.stackexchange.com/questions/49528/batch-gradient-descent-versus-stochastic-gradient-descent
First answer is really good in this link

Related

What is the difference between applying batch norm in DNN vs. using just weights and bias?

Batch Norm is a set of operations applied to each layers' input value. It has the advantage of speedup the learning of network and introducing noise in each layer.
The operation can be summarized as follow :
$$\mu = frac{1}{m} \sum{z^{(i)}}$$
$$\sigma^2 = \frac{1}{m} \sum{(z^{(i)} - \mu)^2}$$
$$z^{(i)}_{norm} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
$$\tilde z^{(i)} = \gamma z^{(i)}_{norm} + \beta$$
The $\gamma, \beta$ are just scalar parameters that multiplied to the input value of each layer. Weights and bias at that layer does the same thing. What are the differences between them.
Is adding new learnable parameter $\gamma, \beta$ trying to achieve the same effect of doubling the hidden layer in neural network ?

batch norm is normalizing the input each mini batch, otherwise, it doesn't.

implementing a MLP model in keras for timeseries prediction but the model doesn't train well

I'm trying to come up with a MLP model for timeseries prediction following this blog post. I have 138 timeseries with a lookback_window=28 (splitted as 50127 timeseries for traing and 24255 timeseries for validation). I need to predict the next value (timesteps=28, n_features=1). I started from a 3 layer network but it didn't train well. I tried to make the network deeper by adding more layers/more hunits, but it doesn't improve. In the picture, you can see the result of prediction of the following model Here is my model code:
inp = Input(batch_shape=(batch_size, lookback_window))
first_layer = Dense(1000, input_dim=28, activation='relu')(inp)
snd_layer = Dense(500)(first_layer)
thirs_layer = Dense(250)(snd_layer)
tmp = Dense(100)(thirs_layer)
tmp2 = Dense(50)(tmp)
tmp3 = Dense(25)(tmp2)
out = Dense(1)(tmp3)
model = Model(inp, out)
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(train_data, train_y,
epochs=1000,
batch_size=539,
validation_data=(validation_data, validation_y),
verbose=1,
shuffle=False)
What am I missing? How can I improve it?

The main thing I noticed is that you are not using non-linearities in your layers. I would use relus for the hidden layers and linear layer for the final layer in case you want values larger than 1 / -1 to be possible. If you do not want them to be possible use tanh. By increasing the data you make the problem harder and therefore your mostly linear model is underfitting severely.

I managed to get better results by the following changes:
Using RMSprop instead of Adam with lr=0.001, and as #TommasoPasini mentioned added them to all Dense layers (expect the last one). It improves the results a lot!
epochs= 3000 instead of 1000.
But now I think it is overfitting. Here are the plots of the results and the validation and train loss:

Using hidden activations in loss function

I want to create a custom loss function for a double-input double-output model in Keras that:
minimizes the reconstruction error of two autoencoders;
maximizes the correlation of the bottleneck features of the autoencoders.
For this I need to pass to the loss function:
both inputs;
both outputs / reconstructions;
output of intermediate layers for both (hidden activations).
I know I can pass both inputs and outputs to Model, but am struggling to find a way to pass the hidden activations.
I could create two new Models that have the output of the intermediate layers and pass that to loss, like:
intermediate_layer_model1 = Model(input=input1, output=autoencoder.get_layer('encoded1').output)
intermediate_layer_model2 = Model(input=input2, output=autoencoder.get_layer('encoded2').output)
autoencoder.compile(optimizer='adadelta', loss=loss(intermediate_layer_model1, intermediate_layer_model2))
But still, I would need to find a way to match the y_true in loss to the correct intermediate model.
What is the right way to approach this?
Edit
Here's an approach that I think should work. Simplified:
# autoencoder 1
input1 = Input(shape=(input_dim,))
encoded1 = Dense(encoding_dim, activation='relu', name='encoded1')(input1)
decoded1 = Dense(input_dim, activation='sigmoid', name='decoded1')(encoded1)
# autoencoder 2
input2 = Input(shape=(input_dim,))
encoded2 = Dense(encoding_dim, activation='relu', name='encoded2')(input2)
decoded2 = Dense(input_dim, activation='sigmoid', name='decoded2')(encoded2)
# merge encodings
merge_layer = merge([encoded1, encoded2], mode='concat', name='merge', concat_axis=1)
model = Model(input=[input1, input2], output=[decoded1, decoded2, merge_layer])
model.compile(optimizer='rmsprop', loss={
'decoded1': 'binary_crossentropy',
'decoded2': 'binary_crossentropy',
'merge': correlation,
})
Then in correlation I can split y_pred and do the calculations.

How about:
Defining a single model with a multiple outputs (be sure that you named a coding and reconstruction layer properly):
duo_model = Model(input=input, output=[coding_layer, reconstruction_layer])
Compiling your model with two different losses (or even performing a loss reweighting):
duo_model.compile(optimizer='rmsprop',
loss={'coding_layer': correlation_loss,
'reconstruction_layer': 'mse'})
Taking your final model as a:
encoder = Model(input=input, output=[coding_layer])
autoencoder = Model(input=input, output=[reconstruction_layer])
After proper compilation this should do the job.
When it comes to defining a proper correlation loss function there are two ways:
when coding layer and your output layer have the same dimension -
you could easly use predefinied cosine_proximity function from
Keras library.
when coding layer has different dimensonality -
you shoud first find embedding of coding vector and reconstruction vector to the same space and then - compute correlation there. Remember that this embedding should either be a Keras layer / function or Theano / Tensor flow operation (depending on which backend you are using). Of course you can compute both embedding and correlation function as a part of one loss function.

Feed Forward - Neural Networks Keras

for my input in the feed forward neural network that I have implemented in Keras, I just wanted to check that my understanding is correct.
[[ 25.26000023 26.37000084 24.67000008 23.30999947]
[ 26.37000084 24.67000008 23.30999947 21.36000061]
[ 24.67000008 23.30999947 21.36000061 19.77000046]...]
So in the data above it is a time window of 4 inputs in an array. My input layer is
model.add(Dense(4, input_dim=4, activation='sigmoid'))
model.fit(trainX, trainY, nb_epoch=10000,verbose=2,batch_size=4)
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch? and does the batch_size need to be 4 in order for this time window to work?
Thanks John

and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch?
Yes, each epoch is iteration over all training samples
and does the batch_size need to be 4 in order for this time window to work?
No, these are completely unrelated things. Batch is simply a subset of your training data which is used to compute approximation of the true gradient of the cost function. Bigger the batch - closer you get to the true gradient (and original Gradient Descent), but training gets slower. Closer to 1 you get - it becomes more and more stochastic, noisy approxmation (and closer to Stochastic Gradient Descent). The fact that you matched batch_size and data dimensionality is just an odd-coincidence, and has no meaning.
Let me put this in more generall setting, what you do in gradient descent with additive loss function (which neural nets usually use) is going against the gradient which is
grad_theta 1/N SUM_i=1^N loss(x_i, pred(x_i), y_i|theta) =
= 1/N SUM_i=1^N grad_theta loss(x_i, pred(x_i), y_i|theta)
where loss is some loss function over your pred (prediction) as compared to y_i.
And in batch based scenatio (the rough idea) is that you do not need to go over all examples, but instead some strict subset, like batch = {(x_1, y_1), (x_5, y_5), (x_89, y_89) ... } and use approximation of the gradient of form
1/|batch| SUM_(x_i, y_i) in batch: grad_theta loss(x_i, pred(x_i), y_i|theta)
As you can see this is not related in any sense to the space where x_i live, thus there is no connection with dimensionality of your data.

Let me explain this with an example:
When you have 32 training examples and you call model.fit with a batch_size of 4, the neural network will be presented with 4 examples at a time, but one epoch will still be defined as one complete pass over all 32 examples. So in this case the network will go through 4 examples at a time, and will ,theoretically at least, call the forward pass (and the backward pass) 32 / 4 = 8 times.
In the extreme case when your batch_size is 1, that is plain old stochastic gradient descent. When your batch_size is greater than 1 then it's called batch gradient descent.

Tensorflow max-margin loss training?

I want to train a neural network in tensorflow with a max-margin loss function using one negative sample per positive sample:
max(0,1 -pos_score +neg_score)
What I'm currently doing is this:
The network takes three inputs: input1, and then one positive example input2_pos and one negative example input2_neg. (These are indices to a word embeddings layer.) The network is supposed to calculate a score that expresses how related two examples are.
Here's a simplified version of my code:
input1 = tf.placeholder(dtype=tf.int32, shape=[batch_size])
input2_pos = tf.placeholder(dtype=tf.int32, shape=[batch_size])
input2_neg = tf.placeholder(dtype=tf.int32, shape=[batch_size])
# f is a neural network outputting a score
pos_score = f(input1,input2_pos)
neg_score = f(input1,input2_neg)
cost = tf.maximum(0., 1. -pos_score +neg_score)
optimizer= tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
What I see when I run this, is that like this the network just learns which input holds the positive example - it always predicts a similar score along the lines of:
pos_score = 0.9965983
neg_score = 0.00341663
How can I structure the variables/training so that the network learns the task instead?
I want just one network that takes two inputs and calculates a score expressing the correlation between them, and train it with max-margin loss.
Calculating scores for positive and negative separately does not seem like an option to me, since then it won't backpropagate properly. Another option seems to be randomizing inputs - but then for the loss function I need to know which example is the positive one - inputting that as another parameter would give away the solution again?
Any ideas?

Given your results (1 for every positive, 0 for every negative) it seems you have two different networks learning:
to predict 1 for the first one
to predict 0 for the second one
When using max-margin loss, you need to use the same network for computing both pos_score and neg_score. The way to do that is to share the variables. I will give you a small example using tf.get_variable():
with tf.variable_scope("network"):
w = tf.get_variable("weights", shape=..., initializer=...)
def f(x, y):
with tf.variable_scope("network", reuse=True):
w = tf.get_variable("weights")
res = w * (x - y) # some computation
return res
With this function f as model, the training will optimize the shared variable with name "network/weights".