How to implement multi-class hinge loss in tensorflow - neural-network

I want to implement multi-class hinge loss in tensorflow. The formulation is as follows:
I find it difficult to get the second max prediction probability when the prediction is correct. I tried to use tf.nn.top_k to calculate it, but unfortunately tf.nn.top_k doesn't implement the gradient operation. So how can I implement this?

top_k has gradients, added in version 0.8 here

Adding another implementation with three lines of code
scores: unscaled scores, tensor, shape=(n_classes, batch_size), dtype=float32
classes: tensor, shape=(batch_size, batch_size), dtype=float32
For implementing above loss with choosing the most violated class instead of considering all classes
#H - hard negative for each sample
H = tf.reduce_max(scores * (1 - classes), 0)
L = tf.nn.relu((1 - scores + H) * classes)
final_loss = tf.reduce_mean(tf.reduce_max(L, 0))
Another implementation where we sum over all negative classes
# implements loss as sum_(j~=y) max(0, 1 - s(x, y) + s(x, j))
def multiclasshingeloss1(scores, classes):
true_classes = tf.argmax(classes, 0)
idx_flattened = tf.range(0, scores.get_shape()[1]) * scores.get_shape()[0]+\
tf.cast(true_classes, dtype=tf.int32)
true_scores = tf.gather(tf.reshape(tf.transpose(scores), [-1]),
L = tf.nn.relu((1 - true_scores + scores) * (1 - classes))
final_loss = tf.reduce_mean(L)
return final_loss
You can minimize the transposes here based on your implementation.

My implementation is as follows but I think there must be more efficient implementations.
logits: unscaled scores, tensor, shape=(batch_size, n_classes)
label: tensor, shape=(batch_size, )
batch_size, n_classes: int
def multi_class_hinge_loss(logits, label, batch_size, n_classes):
# get the correct logit
flat_logits = tf.reshape(logits, (-1,))
correct_id = tf.range(0, batch_size) * n_classes + label
correct_logit = tf.gather(flat_logits, correct_id)
# get the wrong maximum logit
max_label = tf.argmax(logits, 1)
top2, _ = tf.nn.top_k(logits, k=2, sorted=True)
top2 = tf.split(1, 2, top2)
for i in xrange(2):
top2[i] = tf.reshape(top2[i], (batch_size, ))
wrong_max_logit =, label), top2[1], top2[0])
# calculate multi-class hinge loss
return tf.reduce_mean(tf.maximum(0., 1. + wrong_max_logit - correct_logit))


How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
## Setting this seed should allow you to reach the same result as me
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

I want to use Numpy to simulate the inference process of a quantized MobileNet V2 network, but the outcome is different with pytorch realized one

Python version: 3.8
Pytorch version: 1.9.0+cpu
Platform: Anaconda Spyder5.0
To reproduce this problem, just copy every code below to a single file.
The ILSVRC2012_val_00000293.jpg file used in this code is shown below, you also need to download it and then change its destination in the code.
Some background of this problem:
I am now working on a project that aims to develop a hardware accelerator to complete the inference process of the MobileNet V2 network. I used pretrained quantized Pytorch model to simulate the outcome, and the result comes out very well.
In order to use hardware to complete this task, I wish to know every inputs and outputs as well as intermidiate variables during runing this piece of pytorch code. I used a package named torchextractor to fetch the outcomes of first layer, which in this case, is a 3*3 convolution layer.
import numpy as np
import torchvision
import torch
from torchvision import transforms, datasets
from PIL import Image
from torchvision import transforms
import torchextractor as tx
import math
##### Processing of input image
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
test_transform = transforms.Compose([
preprocess = transforms.Compose([
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
#image file destination
filename = "D:\Project_UM\MobileNet_VC709\MobileNet_pytorch\ILSVRC2012_val_00000293.jpg"
input_image =
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0)
#----First verify that the torchextractor class should not influent the inference outcome
# ofmp of layer1 before putting into torchextractor
a,b,c = quantize_tensor(input_batch)# to quantize the input tensor and return an int8 tensor, scale and zero point
input_qa = torch.quantize_per_tensor(torch.tensor(input_batch.clone().detach()), b, c, torch.quint8)# Using quantize_per_tensor method of torch
# Load a quantized mobilenet_v2 model
model_quantized = torchvision.models.quantization.mobilenet_v2(pretrained=True, quantize=True)
with torch.no_grad():
output = model_quantized.features[0][0](input_qa)# Ofmp of layer1, datatype : quantized_tensor
# print("FM of layer1 before tx_extractor:\n",output.int_repr())# Ofmp of layer1, datatype : int8 tensor
output1_clone = output.int_repr().detach().numpy()# Clone ofmp of layer1, datatype : ndarray
# ofmp of layer1 after adding torchextractor
model_quantized_ex = tx.Extractor(model_quantized, ["features.0.0"])#Capture of the module inside first layer
model_output, features = model_quantized_ex(input_batch)# Forward propagation
# feature_shapes = {name: f.shape for name, f in features.items()}
# print(features['features.0.0']) # Ofmp of layer1, datatype : quantized_tensor
out1_clone = features['features.0.0'].int_repr().numpy() # Clone ofmp of layer1, datatype : ndarray
if(out1_clone.all() == output1_clone.all()):
print('Model with torchextractor attached output the same value as the original model')
print('Torchextractor method influence the outcome')
Here I define a numpy quantization scheme based on the quantization scheme proposed by
Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference
# Convert a normal regular tensor to a quantized tensor with scale and zero_point
def quantize_tensor(x, num_bits=8):# to quantize the input tensor and return an int8 tensor, scale and zero point
qmin = 0.
qmax = 2.**num_bits - 1.
min_val, max_val = x.min(), x.max()
scale = (max_val - min_val) / (qmax - qmin)
initial_zero_point = qmin - min_val / scale
zero_point = 0
if initial_zero_point < qmin:
zero_point = qmin
elif initial_zero_point > qmax:
zero_point = qmax
zero_point = initial_zero_point
# print(zero_point)
zero_point = int(zero_point)
q_x = zero_point + x / scale
q_x.clamp_(qmin, qmax).round_()
q_x = q_x.round().byte()
return q_x, scale, zero_point
# #############################################################################################
# --------- Simulate the inference process of layer0: conv33 using numpy
# #############################################################################################
# get the input_batch quantized buffer data
input_scale = b.item()
input_zero = c
input_quantized = a[0].detach().numpy()
# get the layer0 output scale and zero_point
output_scale = model_quantized.features[0][0].state_dict()['scale'].item()
output_zero = model_quantized.features[0][0].state_dict()['zero_point'].item()
# get the quantized weight with scale and zero_point
weight_scale = model_quantized.features[0][0].state_dict()["weight"].q_scale()
weight_zero = model_quantized.features[0][0].state_dict()["weight"].q_zero_point()
weight_quantized = model_quantized.features[0][0].state_dict()["weight"].int_repr().numpy()
# print(weight_quantized)
# print(weight_quantized.shape)
# bias_quantized,bias_scale,bias_zero= quantize_tensor(model_quantized.features[0][0].state_dict()["bias"])# to quantize the input tensor and return an int8 tensor, scale and zero point
# print(bias_quantized.shape)
bias = model_quantized.features[0][0].state_dict()["bias"].detach().numpy()
# print(input_quantized)
Then I write a quantized 2D convolution using numpy, hope to figure out every details in pytorch data flow during the inference.
#%% numpy simulated layer0 convolution function define
def conv_cal(input_quantized, weight_quantized, kernel_size, stride, out_i, out_j, out_k):
weight = weight_quantized[out_i]
input = np.zeros((input_quantized.shape[0], kernel_size, kernel_size))
for i in range(weight.shape[0]):
for j in range(weight.shape[1]):
for k in range(weight.shape[2]):
input[i][j][k] = input_quantized[i][stride*out_j+j][stride*out_k+k]
# print(,input))
# print(input,"\n")
# print(weight)
return np.multiply(weight,input).sum()
def QuantizedConv2D(input_scale, input_zero, input_quantized, output_scale, output_zero, weight_scale, weight_zero, weight_quantized, bias, kernel_size, stride, padding, ofm_size):
output = np.zeros((weight_quantized.shape[0],ofm_size,ofm_size))
input_quantized_padding = np.full((input_quantized.shape[0],input_quantized.shape[1]+2*padding,input_quantized.shape[2]+2*padding),0)
zero_temp = np.full(input_quantized.shape,input_zero)
input_quantized = input_quantized - zero_temp
for i in range(input_quantized.shape[0]):
for j in range(padding,padding + input_quantized.shape[1]):
for k in range(padding,padding + input_quantized.shape[2]):
input_quantized_padding[i][j][k] = input_quantized[i][j-padding][k-padding]
zero_temp = np.full(weight_quantized.shape, weight_zero)
weight_quantized = weight_quantized - zero_temp
for i in range(output.shape[0]):
for j in range(output.shape[1]):
for k in range(output.shape[2]):
# output[i][j][k] = (weight_scale*input_scale)*conv_cal(input_quantized_padding, weight_quantized, kernel_size, stride, i, j, k) + bias[i] #floating_output
output[i][j][k] = weight_scale*input_scale/output_scale*conv_cal(input_quantized_padding, weight_quantized, kernel_size, stride, i, j, k) + bias[i]/output_scale + output_zero
output[i][j][k] = round(output[i][j][k])
# int_output
return output
Here I input the same image, weight, and bias together with their zero_point and scale, then compare this "numpy simulated" result to the PyTorch calculated one.
quantized_model_out1_int8 = np.squeeze(features['features.0.0'].int_repr().numpy())
out1_np = QuantizedConv2D(input_scale, input_zero, input_quantized, output_scale, output_zero, weight_scale, weight_zero, weight_quantized, bias, 3, 2, 1, 112)"out1_np.npy",out1_np)
for i in range(quantized_model_out1_int8.shape[0]):
for j in range(quantized_model_out1_int8.shape[1]):
for k in range(quantized_model_out1_int8.shape[2]):
if(out1_np[i][j][k] < 0):
out1_np[i][j][k] = 0
flag = np.zeros(quantized_model_out1_int8.shape)
for i in range(quantized_model_out1_int8.shape[0]):
for j in range(quantized_model_out1_int8.shape[1]):
for k in range(quantized_model_out1_int8.shape[2]):
if(quantized_model_out1_int8[i][j][k] == out1_np[i][j][k]):
flag[i][j][k] = 1
out1_np[i][j][k] = 0
quantized_model_out1_int8[i][j][k] = 0
# Compare the simulated result to extractor fetched result, gain the total hit rate
If the "numpy simulated" results are the same as the extracted one, call it a hit. Print the total hit rate, it shows that numpy gets 92% of the values right. Now the problem is, I have no idea why the rest 8% of values come out wrong.
Comparison of two outcomes:
The picture below shows the different values between Numpy one and PyTorch one, the sample channel is index[1]. The left upper corner is Numpy one, and the upright corner is PyTorch one, I have set all values that are the same between them to 0, as you can see, most of the values just have a difference of 1(This can be view as the error brought by the precision loss of fixed point arithmetics), but some have large differences, e.g. the value[1][4], 121 vs. 76 (I don't know why)
Focus on one strange value:
This code is used to replay the calculation process of the value[1][4], originally I was expecting a trial and error process could lead me to solve this problem, to get my wanted number of 76, but no matter how I tried, it didn't output 76. If you want to try this, I paste this code for your convenience.
#%% A test code to check the calculation process
weight_quantized_sample = weight_quantized[2]
M_t = input_scale * weight_scale / output_scale
ifmap_t = np.int32(input_quantized[:,1:4,7:10])
weight_t = np.int32(weight_quantized_sample)
bias_t = bias[2]
bias_q = bias_t/output_scale
res_t = 0
for ch in range(3):
ifmap_offset = ifmap_t[ch]-np.int32(input_zero)
weight_offset = weight_t[ch]-np.int32(weight_zero)
res_ch = np.multiply(ifmap_offset, weight_offset)
res_ch = res_ch.sum()
res_t = res_t + res_ch
res_mul = M_t*res_t
# for n in range(1, 30):
# res_mul = multiply(n, M_t, res_t)
res_t = round(res_mul + output_zero + bias_q)
Could you help me out of this, have been stuck here for a long time.
I implemented my own version of quantized convolution and got from 99.999% to 100% hitrate (and mismatch of a single value is by 1 that I can consider to be a rounding issue). The link on the paper in the question helped a lot.
But I found that your formulas are the same as mine. So I don't know what was your issue. As I understand quantization in pytorch is hardware dependent.
Here is my code:
def my_Conv2dRelu_b2(input_q, conv_layer, output_shape):
input_q: quantized tensor
conv_layer: quantized tensor
output_shape: the pre-computed shape of the result
output = np.zeros(output_shape)
# extract needed float numbers from quantized operations
weights_scale = conv_layer.weight().q_per_channel_scales()
input_scale = input_q.q_scale()
weights_zp = conv_layer.weight().q_per_channel_zero_points()
input_zp = input_q.q_zero_point()
# extract needed convolution parameters
padding = conv_layer.padding
stride = conv_layer.stride
# extract float numbers for results
output_zp = conv_layer.zero_point
output_scale = conv_layer.scale
conv_weights_int = conv_layer.weight().int_repr()
input_int = input_q.int_repr()
biases = conv_layer.bias().numpy()
for k in range(input_q.shape[0]):
for i in range(conv_weights_int.shape[0]):
output[k][i] = manual_convolution_quant(
image_zp=input_zp, image_scale=input_scale,
kernel_zp=weights_zp[i].item(), kernel_scale=weights_scale[i].item(),
result_zp=output_zp, result_scale=output_scale
return output
def manual_convolution_quant(image, kernel, b, padding, stride, image_zp, image_scale, kernel_zp, kernel_scale,
result_zp, result_scale):
H = image.shape[1]
W = image.shape[2]
new_H = H // stride[0]
new_W = W // stride[1]
results = np.zeros([new_H, new_W])
M = image_scale * kernel_scale / result_scale
bias = b / result_scale
paddedIm = np.pad(
[(0, 0), (padding[0], padding[0]), (padding[1], padding[1])],
s = kernel.shape[1]
for i in range(new_H):
for j in range(new_W):
patch = paddedIm[
:, i * stride[0]: i * stride[0] + s, j * stride[1]: j * stride[1] + s
res = M * ((kernel - kernel_zp) * (patch - image_zp)).sum() + result_zp + bias
if res < 0:
res = 0
results[i, j] = round(res)
return results
Code to compare pytorch and my own version.
def calc_hit_rate(array1, array2):
good = (array1 == array2).astype(
all = array1.size
return good / all
# during inference
y2 = model.conv1(y1)
y2_int = torch.int_repr(y2)
y2_int_manual = my_Conv2dRelu_b2(y1, model.conv1, y2.shape)
print(f'y2 hit rate= {calc_hit_rate(y2.int_repr().numpy(), y2_int_manual)}') #hit_rate=1.0

Fitting a cross sectional surface profile to a generalized known formula to obtain coefficients and mathematically model the surface

I designed an optical system with an a-spheric surface profile. I then had this lens manufactured and measured. I was given a cross sectional graph from the measurement of the manufactured surface profile. (The surface holds rotational symmetry)
The formula being used to model said aspheric surface is:
How can I fit this generalized equation with my cross sectional curve to obtain corresponding alpha coefficients to the curve? (alpha coefficients are referring to those in the provided formula) I know the radius of curvature of the surface.
I have access to Python and Matlab (no toolboxes) to achieve this. I can also obtain digitized, tabulated data points from the curve.
Assuming you have a array of discreet r and for each value of this array z(r). You want to fit a curve to estimate the parameters of an aspheric lens. I will use lmfit as mentioned here to show one way to do this using python.
Importing the modules used for this:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model, Parameters
Define the function of an asperic lens:
def asphere_complete(x, r0, k, a2, a4, a6, a8, a10, a12):
r_squared = x ** 2.
z_even_r = r_squared * (a2 + (r_squared * (a4 + r_squared * (a6 + r_squared * (a8 + r_squared * (a10 + (r_squared * a12)))))))
square_root_term = 1 - (1 + k) * ((x / r0) ** 2)
zg = (x ** 2) / (r0 * (1 + np.sqrt(square_root_term)))
return z_even_r + zg
As you do not provide any data, I will use the following to create some example data, including artificial noise:
def generate_dummy_data(x, asphere_parameters, noise_sigma, seed=12345):
return asphere_complete(x, **asphere_parameters) + noise_sigma * np.random.randn(x.shape[0])
The following function does the fitting and plots the resulting curve:
def fit_asphere(r, z, fit_parameters):
# create two subplots to plot the original data and the fit in one plot and the residual in another
fig, axarr = plt.subplots(1, 2, figsize=(10, 5))
fit_plot = axarr[0]
residuum_plot = axarr[1]
# configure first plot:
# configure second plot:
# plot original data
fit_plot.plot(r, z, label="Input")
# create an lmfit model and the parameters
function_model = Model(asphere_complete)
# The fitting procedure may throw ValueErrors, if the radicand gets negative
result =, fit_parameters, x=r)
# To plot the resulting curve remove the parameters which were just used for the constraints
opt_parameters = dict(result.values)
opt_parameters.pop('r_max', None)
opt_parameters.pop('radicand', None)
# calculate z-values of fitted curve:
z_fitted = asphere_complete(r, **opt_parameters)
# calculate residual values
z_residual = z - z_fitted
# plot fit and residual:
fit_plot.plot(r, z_fitted, label="Fit")
residuum_plot.plot(r, z_residual, label="Residual")
# legends:
except ValueError as val_error:
print("Fit Failed: ")
To set the parameters of the example data I use the Parametersobject of lmfit:
if __name__ == "__main__":
parameters_dummy = Parameters()
parameters_dummy.add('r0', value=-34.4)
parameters_dummy.add('k', value=-0.98)
parameters_dummy.add('a2', value=0)
parameters_dummy.add('a4', value=-9.67e-9)
parameters_dummy.add('a6', value=1.59e-10)
parameters_dummy.add('a8', value=-5.0e-12)
parameters_dummy.add('a10', value=0)
parameters_dummy.add('a12', value=-1.0e-19)
Create the example data:
r = np.linspace(0, 35, 1000)
z = generate_dummy_data(r, parameters_dummy, 0.00001)
The reason to use lmfitinstead of scipy's curve_fitis that the radicand of the square root may become negativ. We need to ensure:
Therefor, we need to define a constraint as mentioned here.
Let's start to define our parameters we want to use in fitting. The basic radius is added straightforward:
parameters = Parameters()
parameters.add('r0', value=-30, vary=True)
To obey the inequality add a variable radicand which is not allowed to become less than zero. Instead of letting k taking part in the fitting normaly, make it direclty dependend on r0, r_max and radicand. We need to use r_max because the inequality is most problematic for the maximal r. Solving the inequalty for k leads to
which is used as exprbelow. I use a bool flag to switch on/off the constraint:
keep_radicand_safe = True
if keep_radicand_safe:
r_max = np.max(r)
parameters.add('r_max', r_max, vary=False)
parameters.add('radicand', value=0.98, vary=True, min=0)
parameters.add('k', expr='(r0/r_max)**2*(1-radicand)-1')
parameters.add('k', value=-0.98, vary=True)
The remaining parameters are added straightforward:
parameters.add('a2', value=0, vary=False)
parameters.add('a4', value=0, vary=True)
parameters.add('a6', value=0, vary=True)
parameters.add('a8', value=0, vary=True)
parameters.add('a10', value=0, vary=False)
parameters.add('a12', value=0, vary=True)
Now we are ready to start and get our results:
fit_asphere(r, z, parameters)
On the console you should see the output:
r0: -34.3999435 +/- 6.1027e-05 (0.00%) (init = -30)
r_max: 35 (fixed)
radicand: 0.71508611 +/- 0.09385813 (13.13%) (init = 0.98)
k: -0.72477176 +/- 0.09066656 (12.51%) == '(r0/r_max)**2*(1-radicand)-1'
a2: 0 (fixed)
a4: 7.7436e-07 +/- 2.7872e-07 (35.99%) (init = 0)
a6: 2.5547e-10 +/- 6.3330e-11 (24.79%) (init = 0)
a8: -4.9832e-12 +/- 1.7115e-14 (0.34%) (init = 0)
a10: 0 (fixed)
a12: -9.8670e-20 +/- 2.0716e-21 (2.10%) (init = 0)
With the data I use above, you should see the fit fail if keep_radicand_safe is set to False.

How Can I change Theano gradients during backpropagation wrt the current output of the network?

I am trying to code up an example of the Inverting Gradient method from DEEP REINFORCEMENT LEARNING IN PARAMETERIZED ACTION SPACE (equation 11) in Lasagne/Theano. Basically what I am trying to do is ensure the output of the network is within some specified bounds, in this case [1,-1].
I have been looking at the example given here that inverts the gradient which has helped but at this point I am stuck. I think the best place to perform this operation is in the gradient computation method so I copied rmsprop and am trying to edit the gradients before the updates are applied.
This is what I have so far
def rmspropWithInvert(loss_or_grads, params, p, learning_rate=1.0, rho=0.9, epsilon=1e-6):
clip = 2.0
grads = lasagne.updates.get_or_compute_grads(loss_or_grads, params)
# grads = theano.gradient.grad_clip(grads, -clip, clip)
grads_ = []
for grad in grads:
grads_.append(theano.gradient.grad_clip(grad, -clip, clip) )
grads = grads_
a, p_ = T.scalars('a', 'p_')
z_lazy = ifelse(,0.0), (1.0-p_)/(2.0), (p_-(-1.0))/(2.0))
f_lazyifelse = theano.function([a,p_], z_lazy,
# compute the parameter vector to invert the gradients by
ps = theano.shared(
np.zeros((3, 1), dtype=theano.config.floatX),
broadcastable=(False, True))
for i in range(3):
ps[i] = f_lazyifelse(grads[-1][i], p[i])
# Apply vector through computed gradients
for grad in grads.reverse():
grads2.append(theano.mul(ps, grad))
ps = grad
grads = grads2.reverse()
print "Grad Update: " + str(grads[0])
updates = OrderedDict()
# Using theano constant to prevent upcasting of float32
one = T.constant(1)
for param, grad in zip(params, grads):
value = param.get_value(borrow=True)
accu = theano.shared(np.zeros(value.shape, dtype=value.dtype),
accu_new = rho * accu + (one - rho) * grad ** 2
updates[accu] = accu_new
updates[param] = param - (learning_rate * grad /
T.sqrt(accu_new + epsilon))
return updates
Maybe someone more skilled with Theano/Lasagne will see a solution? Conceptually I think the computation is easy but coding everything in the update step symbolically has proven challenging for me. I am still getting used to Theano.

How to monitor tensor values in Theano/Keras?

I know this question has been asked in various forms, but I can't really find any answer I can understand and use. So forgive me if this is a basic question, 'cause I'm a newbie to these tools(theano/keras)
Problem to Solve
Monitor variables in Neural Networks
(e.g. input/forget/output gate values in LSTM)
What I'm currently getting
no matter in which stage I'm getting those values, I'm getting something like :
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
Is there any way I can't monitor(e.g. print to stdout, write to a file, etc) them?
Possible Solution
Seems like callbacks in Keras can do the job, but it doesn't work either for me. I'm getting same thing as above
My Guess
Seems like I'm making very simple mistakes.
Thank you very much in advance, everyone.
Specifically, I'm trying to monitor input/forget/output gating values in LSTM.
I found that LSTM.step() is for computing those values:
def step(self, x, states):
h_tm1 = states[0] # hidden state of the previous time step
c_tm1 = states[1] # cell state from the previous time step
B_U = states[2] # dropout matrices for recurrent units?
B_W = states[3] # dropout matrices for input units?
if self.consume_less == 'cpu': # just cut x into 4 pieces in columns
x_i = x[:, :self.output_dim]
x_f = x[:, self.output_dim: 2 * self.output_dim]
x_c = x[:, 2 * self.output_dim: 3 * self.output_dim]
x_o = x[:, 3 * self.output_dim:]
x_i = * B_W[0], self.W_i) + self.b_i
x_f = * B_W[1], self.W_f) + self.b_f
x_c = * B_W[2], self.W_c) + self.b_c
x_o = * B_W[3], self.W_o) + self.b_o
i = self.inner_activation(x_i + * B_U[0], self.U_i))
f = self.inner_activation(x_f + * B_U[1], self.U_f))
c = f * c_tm1 + i * self.activation(x_c + * B_U[2], self.U_c))
o = self.inner_activation(x_o + * B_U[3], self.U_o))
with open("test_visualization.txt", "a") as myfile:
h = o * self.activation(c)
return h, [h, c]
And as it's in the code above, I tried to write the value of i into a file, but it only gave me values like :
[for{cpu,scan_fn}.2, Subtensor{int64::}.0, Subtensor{int64::}.0]
So I tried i.eval() or i.get_value(), but both failed to give me values.
.eval() gave me this:
theano.gof.fg.MissingInputError: An input of the graph, used to compute Subtensor{::, :int64:}(<TensorType(float32, matrix)>, Constant{10}), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.
and .get_value() gave me this:
AttributeError: 'TensorVariable' object has no attribute 'get_value'
So I backtracked those chains(which line calls which functions..) and tried to get values at every steps I found but in vain.
Feels like I'm in some basic pitfalls.
I use the solution described in the Keras FAQ:
In detail:
from keras import backend as K
intermediate_tensor_function = K.function([model.layers[0].input],[model.layers[layer_of_interest].output])
intermediate_tensor = intermediate_tensor_function([thisInput])[0]
array([[ 3., 17.]], dtype=float32)
However I'd like to use the functional API but I can't seem to get the actual tensor, only the symbolic representation. For example:
<tf.Tensor 'add:0' shape=(?, 2) dtype=float32>
I'm missing something about the interaction of Keras and Tensorflow here but I'm not sure what. Any insight much appreciated.
One solution is to create a version of your network that is truncated at the LSTM layer of which you want to monitor the gate values, and then replace the original layer with a custom layer in which the stepfunction is modified to return not only the hidden layer values, but also the gate values.
For instance, say you want to access the access the gate values of a GRU. Create a custom layer GRU2 that inherits everything from the GRU class, but adapt the step function such that it returns a concatenation of the states you want to monitor, and then takes only the part containing the previous hidden layer activations when computing the next activations. I.e:
def step(self, x, states):
# get prev hidden layer from input that is concatenation of
# prev hidden layer + reset gate + update gate
x = x[:self.output_dim, :]
# This is the original code from the GRU layer
h_tm1 = states[0] # previous memory
B_U = states[1] # dropout matrices for recurrent units
B_W = states[2]
if self.consume_less == 'gpu':
matrix_x = * B_W[0], self.W) + self.b
matrix_inner = * B_U[0], self.U[:, :2 * self.output_dim])
x_z = matrix_x[:, :self.output_dim]
x_r = matrix_x[:, self.output_dim: 2 * self.output_dim]
inner_z = matrix_inner[:, :self.output_dim]
inner_r = matrix_inner[:, self.output_dim: 2 * self.output_dim]
z = self.inner_activation(x_z + inner_z)
r = self.inner_activation(x_r + inner_r)
x_h = matrix_x[:, 2 * self.output_dim:]
inner_h = * h_tm1 * B_U[0], self.U[:, 2 * self.output_dim:])
hh = self.activation(x_h + inner_h)
if self.consume_less == 'cpu':
x_z = x[:, :self.output_dim]
x_r = x[:, self.output_dim: 2 * self.output_dim]
x_h = x[:, 2 * self.output_dim:]
elif self.consume_less == 'mem':
x_z = * B_W[0], self.W_z) + self.b_z
x_r = * B_W[1], self.W_r) + self.b_r
x_h = * B_W[2], self.W_h) + self.b_h
raise Exception('Unknown `consume_less` mode.')
z = self.inner_activation(x_z + * B_U[0], self.U_z))
r = self.inner_activation(x_r + * B_U[1], self.U_r))
hh = self.activation(x_h + * h_tm1 * B_U[2], self.U_h))
h = z * h_tm1 + (1 - z) * hh
# End of original code
# concatenate states you want to monitor, in this case the
# hidden layer activations and gates z and r
all = K.concatenate([h, z, r])
# return everything
return all, [h]
(Note that the only lines I added are at the beginning and end of the function).
If you then run your network with GRU2 as last layer instead of GRU (with return_sequences = True for the GRU2 layer), you can just call predict on your network, this will give you all hidden layer and gate values.
The same thing should work for LSTM, although you might have to puzzle a bit to figure out how to store all the outputs you want in one vector and retrieve them again afterwards.
Hope that helps!
You can use theano's printing module for printing during execution (and not during definition, which is what you're doing and the reason why you're not getting values, but their abstract definition).
Just use the Print function. Don't forget to use the output of Print to continue your graph, otherwise the output will be disconnected and Print will most likely be removed during optimisation. And you will see nothing.
from keras import backend as K
from theano.printing import Print
def someLossFunction(x, ref):
loss = K.square(x - ref)
loss = Print('Loss tensor (before sum)')(loss)
loss = K.sum(loss)
loss = Print('Loss scalar (after sum)')(loss)
return loss
A little bonus you might enjoy.
The Print class has a global_fn parameter, to override the default callback to print. You can provide your own function and directly access to the data, to build a plot for instance.
from keras import backend as K
from theano.printing import Print
import matplotlib.pyplot as plt
curve = []
# the callback function
def myPlottingFn(printObj, data):
global curve
# Store scalar data
# Plot it
fig, ax = plt.subplots()
ax.plot(curve, label=printObj.message)
def someLossFunction(x, ref):
loss = K.sum(K.square(x - ref))
# Callback is defined line below
loss = Print('Loss scalar (after sum)', global_fn=myplottingFn)(loss)
return loss
BTW the string you passed to Print('...') is stored in the print object under property name message (see function myPlottingFn). This is useful for building multi-curves plot automatically