Which axis of the input of the RNN is used as the "temporal" axis in Keras? - classification

When using a SimpleRNN or LSTM for classical sentiment analysis algorithms (applied here to sentences of length <= 250 words/tokens):
model = Sequential()
model.add(Embedding(5000, 32, input_length=250)) # Output shape: (None, 250, 32)
model.add(SimpleRNN(100)) # Output shape: (None, 100)
model.add(Dense(1, activation='sigmoid')) # Output shape: (None, 1)
where is it specified which axis of the input of the RNN is used as the "temporal" axis?
To be more precise, after the Embedding layer, a given input sentence, e.g. "the cat sat on the mat", is encoded into a matrix x of shape (250, 32), where 250 is the max length (in words) of the input text, and 32 the dimension of the embedding. Then, where in Keras is it specified if this will be used:
h[t] = activation( W_h * x[:, t] + U_h * h[t-1] + b_h )
or this:
h[t] = activation( W_h * x[t, :] + U_h * h[t-1] + b_h )
(In both cases, y[t] = activation( W_y * h[t] + b_y ))
TL;DR: if an input for a RNN Keras layer is of size, say, (250, 32), which axis does it use as the temporal axis by default? Where is this detailed in the Keras or Tensorflow documentation?

The Equation, h[t] = activation( W_h * x[t, :] + U_h * h[t-1] + b_h ) will be used by Keras or Tensorflow.
In this Tensorflow Documentation, it is mentioned that
inputs: A 3D tensor, with shape [batch, timesteps, feature].
We shall ignore the First Argument, batch and consider the rest of the Two Arguments for sometime.
In case of Sentiment Analysis, we can consider each word/token (converted to Vector) as a Timestep, as shown in the screenshot below.
In case of your example with Encoded Matrix with shape (250, 32), it means that each Review or Instance has 250 words/tokens or Timesteps.
So, the equation, h[t] = activation( W_h * x[t, :] + U_h * h[t-1] + b_h ) can be translated as
h[t] = activation( W_h * x[Each Time Step/Word, All Features] + U_h * h[Previous Time Step/Word] + b_h )
Hope this clarifies your question. Happy Learning!

Related

Reconstructing Sklearn MLP Regression in MatLab

I am using Sklearn to train a MultiLayer Perceptron Regression on 12 features and one output. The StandardScalar() is fit to the training data and applied to all input data. After a training period with architectural optimization, I get a model that is seemingly quite accurate (<10% error). I now need to extract the weights and biases in order to implement the prediction in real time on a system that interacts with a person. This is being done with my_model.coefs_ for weights and my_model.intercepts_ for the biases. The weights are appropriately shaped for the number of nodes in my model and the biases have the appropriate lengths for each layer.
The problem is now that I implement the matrix algebra in MatLab and get wildly different predictions from what my_model.predict() yields.
My reconstruction process for a 2 layer MLP (with 11 nodes in the first layer and 10 nodes in the second):
scale() % elementwise subtract feature mean and divide by feature stdev
scaled_obs = scale(raw_obs)
% Up to this point results from MatLab == Sklearn
weight1 = [12x11] % weights to transition from the input layer to the first hidden layer
weight2 = [11x10]
weight3 = [10x1]
bias1 = [11x1] % bias to add to the first layer after weight1 has been applied
bias2 = [10x1]
bias3 = [1x1]
my_prediction = ((( scaled_obs * w1 + b1') * w2 + b2') * w3 + b3);
I also tried
my_prediction2 = ((( scaled_obs * w1 .* b1') * w2 .* b2') * w3 .* b3); % because nothing worked...```
for my specific data:
Sklearn prediction = 1.731
my_prediction = -50.347
my_prediction2 = -3.2075
Is there another weight/bias that I am skipping when extracting relevant params from my_model? Is my order of operations in the reconstruction flawed?
In my opinion my_prediction = ((( scaled_obs * w1 + b1') * w2 + b2') * w3 + b3); is correct, but there is only 1 missing part and that is activation function. What was the activation function you had passed for the model. By default MLPRegressor have relu as activation function from first layer to third last layer(inclusive). Second last layer doesn't have any activation function. And output layer have a separate activation function which is identity function, basically f(x) = x so you don't have to do anything for that.
If you selected relu or if You didn't at all selected an activation (then relu is default), then you have to do something like this in numpy as np.maximum(0, your_layer1_calculation), I am not sure how this is done in matlab
So final formula would be :
layer1 = np.dot(scaled_inputs, weight0) + bias0
layer2 = np.dot(np.maximum(0, layer1), weight1) + bias1
layer......
layer(n-1) = np.dot(np.maximum(0, layer(n-2), weight(n-1)) + bias(n-1)
layer(n) = layer(n-1) # identity function

Pytorch, what are the gradient arguments

I am reading through the documentation of PyTorch and found an example where they write
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
where x was an initial variable, from which y was constructed (a 3-vector). The question is, what are the 0.1, 1.0 and 0.0001 arguments of the gradients tensor ? The documentation is not very clear on that.
Explanation
For neural networks, we usually use loss to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value. In order to update the parameters of the network, we need to calculate the gradient of loss w.r.t to the parameters, which is actually leaf node in the computation graph (by the way, these parameters are mostly the weight and bias of various layers such Convolution, Linear and so on).
According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up.
The gradient arguments of a Variable's backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weight is just the derivate of final loss w.r.t each element of the intermediate variable.
A concrete example
Let's take a concrete and simple example to understand this.
from torch.autograd import Variable
import torch
x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1)
# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated
# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)
In the above example, the outcome of first print is
2 0 0 0
[torch.FloatTensor of size 1x4]
which is exactly the derivative of z_1 w.r.t to x.
The outcome of second print is :
0 2 0 0
[torch.FloatTensor of size 1x4]
which is the derivative of z_2 w.r.t to x.
Now if use a weight of [1, 1, 1, 1] to calculate the derivative of z w.r.t to x, the outcome is 1*dz_1/dx + 1*dz_2/dx + 1*dz_3/dx + 1*dz_4/dx. So no surprisingly, the output of 3rd print is:
2 2 2 2
[torch.FloatTensor of size 1x4]
It should be noted that weight vector [1, 1, 1, 1] is exactly derivative of loss w.r.t to z_1, z_2, z_3 and z_4. The derivative of loss w.r.t to x is calculated as:
d(loss)/dx = d(loss)/dz_1 * dz_1/dx + d(loss)/dz_2 * dz_2/dx + d(loss)/dz_3 * dz_3/dx + d(loss)/dz_4 * dz_4/dx
So the output of 4th print is the same as the 3rd print:
2 2 2 2
[torch.FloatTensor of size 1x4]
Typically, your computational graph has one scalar output says loss. Then you can compute the gradient of loss w.r.t. the weights (w) by loss.backward(). Where the default argument of backward() is 1.0.
If your output has multiple values (e.g. loss=[loss1, loss2, loss3]), you can compute the gradients of loss w.r.t. the weights by loss.backward(torch.FloatTensor([1.0, 1.0, 1.0])).
Furthermore, if you want to add weights or importances to different losses, you can use loss.backward(torch.FloatTensor([-0.1, 1.0, 0.0001])).
This means to calculate -0.1*d(loss1)/dw, d(loss2)/dw, 0.0001*d(loss3)/dw simultaneously.
Here, the output of forward(), i.e. y is a a 3-vector.
The three values are the gradients at the output of the network. They are usually set to 1.0 if y is the final output, but can have other values as well, especially if y is part of a bigger network.
For eg. if x is the input, y = [y1, y2, y3] is an intermediate output which is used to compute the final output z,
Then,
dz/dx = dz/dy1 * dy1/dx + dz/dy2 * dy2/dx + dz/dy3 * dy3/dx
So here, the three values to backward are
[dz/dy1, dz/dy2, dz/dy3]
and then backward() computes dz/dx
The original code I haven't found on PyTorch website anymore.
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
The problem with the code above is there is no function based on how to calculate the gradients. This means we don't know how many parameters (arguments the function takes) and the dimension of parameters.
To fully understand this I created an example close to the original:
Example 1:
a = torch.tensor([1.0, 2.0, 3.0], requires_grad = True)
b = torch.tensor([3.0, 4.0, 5.0], requires_grad = True)
c = torch.tensor([6.0, 7.0, 8.0], requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients,retain_graph=True)
print(a.grad) # tensor([3.0000e-01, 3.0000e+00, 3.0000e-04])
print(b.grad) # tensor([1.2000e+00, 1.6000e+01, 2.0000e-03])
print(c.grad) # tensor([1.6667e-02, 1.4286e-01, 1.2500e-05])
I assumed our function is y=3*a + 2*b*b + torch.log(c) and the parameters are tensors with three elements inside.
You can think of the gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) like this is the accumulator.
As you may hear, PyTorch autograd system calculation is equivalent to Jacobian product.
In case you have a function, like we did:
y=3*a + 2*b*b + torch.log(c)
Jacobian would be [3, 4*b, 1/c]. However, this Jacobian is not how PyTorch is doing things to calculate the gradients at a certain point.
PyTorch uses forward pass and backward mode automatic differentiation (AD) in tandem.
There is no symbolic math involved and no numerical differentiation.
Numerical differentiation would be to calculate δy/δb, for b=1 and b=1+ε where ε is small.
If you don't use gradients in y.backward():
Example 2
a = torch.tensor(0.1, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(0.1, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward()
print(a.grad) # tensor(3.)
print(b.grad) # tensor(4.)
print(c.grad) # tensor(10.)
You will simply get the result at a point, based on how you set your a, b, c tensors initially.
Be careful how you initialize your a, b, c:
Example 3:
a = torch.empty(1, requires_grad = True, pin_memory=True)
b = torch.empty(1, requires_grad = True, pin_memory=True)
c = torch.empty(1, requires_grad = True, pin_memory=True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(a.grad) # tensor([3.3003])
print(b.grad) # tensor([0.])
print(c.grad) # tensor([inf])
If you use torch.empty() and don't use pin_memory=True you may have different results each time.
Also, note gradients are like accumulators so zero them when needed.
Example 4:
a = torch.tensor(1.0, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(1.0, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward(retain_graph=True)
y.backward()
print(a.grad) # tensor(6.)
print(b.grad) # tensor(8.)
print(c.grad) # tensor(2.)
Lastly few tips on terms PyTorch uses:
PyTorch creates a dynamic computational graph when calculating the gradients in forward pass. This looks much like a tree.
So you will often hear the leaves of this tree are input tensors and the root is output tensor.
Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule. This multiplying occurs in the backward pass.
Back some time I created PyTorch Automatic Differentiation tutorial that you may check interesting explaining all the tiny details about AD.

Fitting a cross sectional surface profile to a generalized known formula to obtain coefficients and mathematically model the surface

I designed an optical system with an a-spheric surface profile. I then had this lens manufactured and measured. I was given a cross sectional graph from the measurement of the manufactured surface profile. (The surface holds rotational symmetry)
The formula being used to model said aspheric surface is:
How can I fit this generalized equation with my cross sectional curve to obtain corresponding alpha coefficients to the curve? (alpha coefficients are referring to those in the provided formula) I know the radius of curvature of the surface.
I have access to Python and Matlab (no toolboxes) to achieve this. I can also obtain digitized, tabulated data points from the curve.
Assuming you have a array of discreet r and for each value of this array z(r). You want to fit a curve to estimate the parameters of an aspheric lens. I will use lmfit as mentioned here to show one way to do this using python.
Importing the modules used for this:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model, Parameters
Define the function of an asperic lens:
def asphere_complete(x, r0, k, a2, a4, a6, a8, a10, a12):
r_squared = x ** 2.
z_even_r = r_squared * (a2 + (r_squared * (a4 + r_squared * (a6 + r_squared * (a8 + r_squared * (a10 + (r_squared * a12)))))))
square_root_term = 1 - (1 + k) * ((x / r0) ** 2)
zg = (x ** 2) / (r0 * (1 + np.sqrt(square_root_term)))
return z_even_r + zg
As you do not provide any data, I will use the following to create some example data, including artificial noise:
def generate_dummy_data(x, asphere_parameters, noise_sigma, seed=12345):
np.random.seed(seed)
return asphere_complete(x, **asphere_parameters) + noise_sigma * np.random.randn(x.shape[0])
The following function does the fitting and plots the resulting curve:
def fit_asphere(r, z, fit_parameters):
# create two subplots to plot the original data and the fit in one plot and the residual in another
fig, axarr = plt.subplots(1, 2, figsize=(10, 5))
fit_plot = axarr[0]
residuum_plot = axarr[1]
# configure first plot:
fit_plot.set_xlabel("r")
fit_plot.set_ylabel("z")
fit_plot.grid()
# configure second plot:
residuum_plot.set_xlabel("r")
residuum_plot.set_ylabel("$\Delta$z")
residuum_plot.grid()
# plot original data
fit_plot.plot(r, z, label="Input")
# create an lmfit model and the parameters
function_model = Model(asphere_complete)
# The fitting procedure may throw ValueErrors, if the radicand gets negative
try:
result = function_model.fit(z, fit_parameters, x=r)
# To plot the resulting curve remove the parameters which were just used for the constraints
opt_parameters = dict(result.values)
opt_parameters.pop('r_max', None)
opt_parameters.pop('radicand', None)
# calculate z-values of fitted curve:
z_fitted = asphere_complete(r, **opt_parameters)
# calculate residual values
z_residual = z - z_fitted
# plot fit and residual:
fit_plot.plot(r, z_fitted, label="Fit")
residuum_plot.plot(r, z_residual, label="Residual")
# legends:
fit_plot.legend(loc="best")
residuum_plot.legend(loc="best")
print(result.fit_report())
except ValueError as val_error:
print("Fit Failed: ")
print(val_error)
To set the parameters of the example data I use the Parametersobject of lmfit:
if __name__ == "__main__":
parameters_dummy = Parameters()
parameters_dummy.add('r0', value=-34.4)
parameters_dummy.add('k', value=-0.98)
parameters_dummy.add('a2', value=0)
parameters_dummy.add('a4', value=-9.67e-9)
parameters_dummy.add('a6', value=1.59e-10)
parameters_dummy.add('a8', value=-5.0e-12)
parameters_dummy.add('a10', value=0)
parameters_dummy.add('a12', value=-1.0e-19)
Create the example data:
r = np.linspace(0, 35, 1000)
z = generate_dummy_data(r, parameters_dummy, 0.00001)
The reason to use lmfitinstead of scipy's curve_fitis that the radicand of the square root may become negativ. We need to ensure:
Therefor, we need to define a constraint as mentioned here.
Let's start to define our parameters we want to use in fitting. The basic radius is added straightforward:
parameters = Parameters()
parameters.add('r0', value=-30, vary=True)
To obey the inequality add a variable radicand which is not allowed to become less than zero. Instead of letting k taking part in the fitting normaly, make it direclty dependend on r0, r_max and radicand. We need to use r_max because the inequality is most problematic for the maximal r. Solving the inequalty for k leads to
which is used as exprbelow. I use a bool flag to switch on/off the constraint:
keep_radicand_safe = True
if keep_radicand_safe:
r_max = np.max(r)
parameters.add('r_max', r_max, vary=False)
parameters.add('radicand', value=0.98, vary=True, min=0)
parameters.add('k', expr='(r0/r_max)**2*(1-radicand)-1')
else:
parameters.add('k', value=-0.98, vary=True)
The remaining parameters are added straightforward:
parameters.add('a2', value=0, vary=False)
parameters.add('a4', value=0, vary=True)
parameters.add('a6', value=0, vary=True)
parameters.add('a8', value=0, vary=True)
parameters.add('a10', value=0, vary=False)
parameters.add('a12', value=0, vary=True)
Now we are ready to start and get our results:
fit_asphere(r, z, parameters)
plt.show()
On the console you should see the output:
[[Variables]]
r0: -34.3999435 +/- 6.1027e-05 (0.00%) (init = -30)
r_max: 35 (fixed)
radicand: 0.71508611 +/- 0.09385813 (13.13%) (init = 0.98)
k: -0.72477176 +/- 0.09066656 (12.51%) == '(r0/r_max)**2*(1-radicand)-1'
a2: 0 (fixed)
a4: 7.7436e-07 +/- 2.7872e-07 (35.99%) (init = 0)
a6: 2.5547e-10 +/- 6.3330e-11 (24.79%) (init = 0)
a8: -4.9832e-12 +/- 1.7115e-14 (0.34%) (init = 0)
a10: 0 (fixed)
a12: -9.8670e-20 +/- 2.0716e-21 (2.10%) (init = 0)
With the data I use above, you should see the fit fail if keep_radicand_safe is set to False.

How to monitor tensor values in Theano/Keras?

I know this question has been asked in various forms, but I can't really find any answer I can understand and use. So forgive me if this is a basic question, 'cause I'm a newbie to these tools(theano/keras)
Problem to Solve
Monitor variables in Neural Networks
(e.g. input/forget/output gate values in LSTM)
What I'm currently getting
no matter in which stage I'm getting those values, I'm getting something like :
Elemwise{mul,no_inplace}.0
Elemwise{mul,no_inplace}.0
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
Subtensor{int64}.0
Subtensor{int64}.0
Is there any way I can't monitor(e.g. print to stdout, write to a file, etc) them?
Possible Solution
Seems like callbacks in Keras can do the job, but it doesn't work either for me. I'm getting same thing as above
My Guess
Seems like I'm making very simple mistakes.
Thank you very much in advance, everyone.
ADDED
Specifically, I'm trying to monitor input/forget/output gating values in LSTM.
I found that LSTM.step() is for computing those values:
def step(self, x, states):
h_tm1 = states[0] # hidden state of the previous time step
c_tm1 = states[1] # cell state from the previous time step
B_U = states[2] # dropout matrices for recurrent units?
B_W = states[3] # dropout matrices for input units?
if self.consume_less == 'cpu': # just cut x into 4 pieces in columns
x_i = x[:, :self.output_dim]
x_f = x[:, self.output_dim: 2 * self.output_dim]
x_c = x[:, 2 * self.output_dim: 3 * self.output_dim]
x_o = x[:, 3 * self.output_dim:]
else:
x_i = K.dot(x * B_W[0], self.W_i) + self.b_i
x_f = K.dot(x * B_W[1], self.W_f) + self.b_f
x_c = K.dot(x * B_W[2], self.W_c) + self.b_c
x_o = K.dot(x * B_W[3], self.W_o) + self.b_o
i = self.inner_activation(x_i + K.dot(h_tm1 * B_U[0], self.U_i))
f = self.inner_activation(x_f + K.dot(h_tm1 * B_U[1], self.U_f))
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1 * B_U[2], self.U_c))
o = self.inner_activation(x_o + K.dot(h_tm1 * B_U[3], self.U_o))
with open("test_visualization.txt", "a") as myfile:
myfile.write(str(i)+"\n")
h = o * self.activation(c)
return h, [h, c]
And as it's in the code above, I tried to write the value of i into a file, but it only gave me values like :
Elemwise{mul,no_inplace}.0
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
Subtensor{int64}.0
So I tried i.eval() or i.get_value(), but both failed to give me values.
.eval() gave me this:
theano.gof.fg.MissingInputError: An input of the graph, used to compute Subtensor{::, :int64:}(<TensorType(float32, matrix)>, Constant{10}), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.
and .get_value() gave me this:
AttributeError: 'TensorVariable' object has no attribute 'get_value'
So I backtracked those chains(which line calls which functions..) and tried to get values at every steps I found but in vain.
Feels like I'm in some basic pitfalls.
I use the solution described in the Keras FAQ:
http://keras.io/getting-started/faq/#how-can-i-visualize-the-output-of-an-intermediate-layer
In detail:
from keras import backend as K
intermediate_tensor_function = K.function([model.layers[0].input],[model.layers[layer_of_interest].output])
intermediate_tensor = intermediate_tensor_function([thisInput])[0]
yields:
array([[ 3., 17.]], dtype=float32)
However I'd like to use the functional API but I can't seem to get the actual tensor, only the symbolic representation. For example:
model.layers[1].output
yields:
<tf.Tensor 'add:0' shape=(?, 2) dtype=float32>
I'm missing something about the interaction of Keras and Tensorflow here but I'm not sure what. Any insight much appreciated.
One solution is to create a version of your network that is truncated at the LSTM layer of which you want to monitor the gate values, and then replace the original layer with a custom layer in which the stepfunction is modified to return not only the hidden layer values, but also the gate values.
For instance, say you want to access the access the gate values of a GRU. Create a custom layer GRU2 that inherits everything from the GRU class, but adapt the step function such that it returns a concatenation of the states you want to monitor, and then takes only the part containing the previous hidden layer activations when computing the next activations. I.e:
def step(self, x, states):
# get prev hidden layer from input that is concatenation of
# prev hidden layer + reset gate + update gate
x = x[:self.output_dim, :]
###############################################
# This is the original code from the GRU layer
#
h_tm1 = states[0] # previous memory
B_U = states[1] # dropout matrices for recurrent units
B_W = states[2]
if self.consume_less == 'gpu':
matrix_x = K.dot(x * B_W[0], self.W) + self.b
matrix_inner = K.dot(h_tm1 * B_U[0], self.U[:, :2 * self.output_dim])
x_z = matrix_x[:, :self.output_dim]
x_r = matrix_x[:, self.output_dim: 2 * self.output_dim]
inner_z = matrix_inner[:, :self.output_dim]
inner_r = matrix_inner[:, self.output_dim: 2 * self.output_dim]
z = self.inner_activation(x_z + inner_z)
r = self.inner_activation(x_r + inner_r)
x_h = matrix_x[:, 2 * self.output_dim:]
inner_h = K.dot(r * h_tm1 * B_U[0], self.U[:, 2 * self.output_dim:])
hh = self.activation(x_h + inner_h)
else:
if self.consume_less == 'cpu':
x_z = x[:, :self.output_dim]
x_r = x[:, self.output_dim: 2 * self.output_dim]
x_h = x[:, 2 * self.output_dim:]
elif self.consume_less == 'mem':
x_z = K.dot(x * B_W[0], self.W_z) + self.b_z
x_r = K.dot(x * B_W[1], self.W_r) + self.b_r
x_h = K.dot(x * B_W[2], self.W_h) + self.b_h
else:
raise Exception('Unknown `consume_less` mode.')
z = self.inner_activation(x_z + K.dot(h_tm1 * B_U[0], self.U_z))
r = self.inner_activation(x_r + K.dot(h_tm1 * B_U[1], self.U_r))
hh = self.activation(x_h + K.dot(r * h_tm1 * B_U[2], self.U_h))
h = z * h_tm1 + (1 - z) * hh
#
# End of original code
###########################################################
# concatenate states you want to monitor, in this case the
# hidden layer activations and gates z and r
all = K.concatenate([h, z, r])
# return everything
return all, [h]
(Note that the only lines I added are at the beginning and end of the function).
If you then run your network with GRU2 as last layer instead of GRU (with return_sequences = True for the GRU2 layer), you can just call predict on your network, this will give you all hidden layer and gate values.
The same thing should work for LSTM, although you might have to puzzle a bit to figure out how to store all the outputs you want in one vector and retrieve them again afterwards.
Hope that helps!
You can use theano's printing module for printing during execution (and not during definition, which is what you're doing and the reason why you're not getting values, but their abstract definition).
Print
Just use the Print function. Don't forget to use the output of Print to continue your graph, otherwise the output will be disconnected and Print will most likely be removed during optimisation. And you will see nothing.
from keras import backend as K
from theano.printing import Print
def someLossFunction(x, ref):
loss = K.square(x - ref)
loss = Print('Loss tensor (before sum)')(loss)
loss = K.sum(loss)
loss = Print('Loss scalar (after sum)')(loss)
return loss
Plot
A little bonus you might enjoy.
The Print class has a global_fn parameter, to override the default callback to print. You can provide your own function and directly access to the data, to build a plot for instance.
from keras import backend as K
from theano.printing import Print
import matplotlib.pyplot as plt
curve = []
# the callback function
def myPlottingFn(printObj, data):
global curve
# Store scalar data
curve.append(data)
# Plot it
fig, ax = plt.subplots()
ax.plot(curve, label=printObj.message)
ax.legend(loc='best')
plt.show()
def someLossFunction(x, ref):
loss = K.sum(K.square(x - ref))
# Callback is defined line below
loss = Print('Loss scalar (after sum)', global_fn=myplottingFn)(loss)
return loss
BTW the string you passed to Print('...') is stored in the print object under property name message (see function myPlottingFn). This is useful for building multi-curves plot automatically

How to implement multi-class hinge loss in tensorflow

I want to implement multi-class hinge loss in tensorflow. The formulation is as follows:
I find it difficult to get the second max prediction probability when the prediction is correct. I tried to use tf.nn.top_k to calculate it, but unfortunately tf.nn.top_k doesn't implement the gradient operation. So how can I implement this?
top_k has gradients, added in version 0.8 here
Adding another implementation with three lines of code
scores: unscaled scores, tensor, shape=(n_classes, batch_size), dtype=float32
classes: tensor, shape=(batch_size, batch_size), dtype=float32
For implementing above loss with choosing the most violated class instead of considering all classes
#H - hard negative for each sample
H = tf.reduce_max(scores * (1 - classes), 0)
L = tf.nn.relu((1 - scores + H) * classes)
final_loss = tf.reduce_mean(tf.reduce_max(L, 0))
Another implementation where we sum over all negative classes
# implements loss as sum_(j~=y) max(0, 1 - s(x, y) + s(x, j))
def multiclasshingeloss1(scores, classes):
true_classes = tf.argmax(classes, 0)
idx_flattened = tf.range(0, scores.get_shape()[1]) * scores.get_shape()[0]+\
tf.cast(true_classes, dtype=tf.int32)
true_scores = tf.gather(tf.reshape(tf.transpose(scores), [-1]),
idx_flattened)
L = tf.nn.relu((1 - true_scores + scores) * (1 - classes))
final_loss = tf.reduce_mean(L)
return final_loss
You can minimize the transposes here based on your implementation.
My implementation is as follows but I think there must be more efficient implementations.
logits: unscaled scores, tensor, shape=(batch_size, n_classes)
label: tensor, shape=(batch_size, )
batch_size, n_classes: int
def multi_class_hinge_loss(logits, label, batch_size, n_classes):
# get the correct logit
flat_logits = tf.reshape(logits, (-1,))
correct_id = tf.range(0, batch_size) * n_classes + label
correct_logit = tf.gather(flat_logits, correct_id)
# get the wrong maximum logit
max_label = tf.argmax(logits, 1)
top2, _ = tf.nn.top_k(logits, k=2, sorted=True)
top2 = tf.split(1, 2, top2)
for i in xrange(2):
top2[i] = tf.reshape(top2[i], (batch_size, ))
wrong_max_logit = tf.select(tf.equal(max_label, label), top2[1], top2[0])
# calculate multi-class hinge loss
return tf.reduce_mean(tf.maximum(0., 1. + wrong_max_logit - correct_logit))