PyTorch LayerNorm vs nnabla layer_normalization -> output with same normalized input size differ, why? - neural-network

PyTorch layernorm_before = nn.LayerNorm(hidden_size) input size(normalized shape = 768) Output torch size = 768 nnabla code: input = (1,197,768) batch axis =1 means 197 and batch axis = 2 means 768
layernorm = PF.layer_normalization(x, batch_axis=2) here input shape to nnabla is 768 but the output shape is (1,197,1)
Layernorm = PF.layer_normalization(x, batch_axis=1) here input shape to nnabla is 197 but the output shape is (1,1,768) as scenario 2 is matching the Pytorch output, but scenario 1 should have matched the output, why is this behaviour ?

Related

I want to use Numpy to simulate the inference process of a quantized MobileNet V2 network, but the outcome is different with pytorch realized one

Python version: 3.8
Pytorch version: 1.9.0+cpu
Platform: Anaconda Spyder5.0
To reproduce this problem, just copy every code below to a single file.
The ILSVRC2012_val_00000293.jpg file used in this code is shown below, you also need to download it and then change its destination in the code.
Some background of this problem:
I am now working on a project that aims to develop a hardware accelerator to complete the inference process of the MobileNet V2 network. I used pretrained quantized Pytorch model to simulate the outcome, and the result comes out very well.
In order to use hardware to complete this task, I wish to know every inputs and outputs as well as intermidiate variables during runing this piece of pytorch code. I used a package named torchextractor to fetch the outcomes of first layer, which in this case, is a 3*3 convolution layer.
import numpy as np
import torchvision
import torch
from torchvision import transforms, datasets
from PIL import Image
from torchvision import transforms
import torchextractor as tx
import math
#########################################################################################
##### Processing of input image
#########################################################################################
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
test_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,])
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
#image file destination
filename = "D:\Project_UM\MobileNet_VC709\MobileNet_pytorch\ILSVRC2012_val_00000293.jpg"
input_image = Image.open(filename)
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0)
#########################################################################################
#########################################################################################
#########################################################################################
#----First verify that the torchextractor class should not influent the inference outcome
# ofmp of layer1 before putting into torchextractor
a,b,c = quantize_tensor(input_batch)# to quantize the input tensor and return an int8 tensor, scale and zero point
input_qa = torch.quantize_per_tensor(torch.tensor(input_batch.clone().detach()), b, c, torch.quint8)# Using quantize_per_tensor method of torch
# Load a quantized mobilenet_v2 model
model_quantized = torchvision.models.quantization.mobilenet_v2(pretrained=True, quantize=True)
model_quantized.eval()
with torch.no_grad():
output = model_quantized.features[0][0](input_qa)# Ofmp of layer1, datatype : quantized_tensor
# print("FM of layer1 before tx_extractor:\n",output.int_repr())# Ofmp of layer1, datatype : int8 tensor
output1_clone = output.int_repr().detach().numpy()# Clone ofmp of layer1, datatype : ndarray
#########################################################################################
#########################################################################################
#########################################################################################
# ofmp of layer1 after adding torchextractor
model_quantized_ex = tx.Extractor(model_quantized, ["features.0.0"])#Capture of the module inside first layer
model_output, features = model_quantized_ex(input_batch)# Forward propagation
# feature_shapes = {name: f.shape for name, f in features.items()}
# print(features['features.0.0']) # Ofmp of layer1, datatype : quantized_tensor
out1_clone = features['features.0.0'].int_repr().numpy() # Clone ofmp of layer1, datatype : ndarray
if(out1_clone.all() == output1_clone.all()):
print('Model with torchextractor attached output the same value as the original model')
else:
print('Torchextractor method influence the outcome')
Here I define a numpy quantization scheme based on the quantization scheme proposed by
Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference
# Convert a normal regular tensor to a quantized tensor with scale and zero_point
def quantize_tensor(x, num_bits=8):# to quantize the input tensor and return an int8 tensor, scale and zero point
qmin = 0.
qmax = 2.**num_bits - 1.
min_val, max_val = x.min(), x.max()
scale = (max_val - min_val) / (qmax - qmin)
initial_zero_point = qmin - min_val / scale
zero_point = 0
if initial_zero_point < qmin:
zero_point = qmin
elif initial_zero_point > qmax:
zero_point = qmax
else:
zero_point = initial_zero_point
# print(zero_point)
zero_point = int(zero_point)
q_x = zero_point + x / scale
q_x.clamp_(qmin, qmax).round_()
q_x = q_x.round().byte()
return q_x, scale, zero_point
#%%
# #############################################################################################
# --------- Simulate the inference process of layer0: conv33 using numpy
# #############################################################################################
# get the input_batch quantized buffer data
input_scale = b.item()
input_zero = c
input_quantized = a[0].detach().numpy()
# get the layer0 output scale and zero_point
output_scale = model_quantized.features[0][0].state_dict()['scale'].item()
output_zero = model_quantized.features[0][0].state_dict()['zero_point'].item()
# get the quantized weight with scale and zero_point
weight_scale = model_quantized.features[0][0].state_dict()["weight"].q_scale()
weight_zero = model_quantized.features[0][0].state_dict()["weight"].q_zero_point()
weight_quantized = model_quantized.features[0][0].state_dict()["weight"].int_repr().numpy()
# print(weight_quantized)
# print(weight_quantized.shape)
# bias_quantized,bias_scale,bias_zero= quantize_tensor(model_quantized.features[0][0].state_dict()["bias"])# to quantize the input tensor and return an int8 tensor, scale and zero point
# print(bias_quantized.shape)
bias = model_quantized.features[0][0].state_dict()["bias"].detach().numpy()
# print(input_quantized)
print(type(input_scale))
print(type(output_scale))
print(type(weight_scale))
Then I write a quantized 2D convolution using numpy, hope to figure out every details in pytorch data flow during the inference.
#%% numpy simulated layer0 convolution function define
def conv_cal(input_quantized, weight_quantized, kernel_size, stride, out_i, out_j, out_k):
weight = weight_quantized[out_i]
input = np.zeros((input_quantized.shape[0], kernel_size, kernel_size))
for i in range(weight.shape[0]):
for j in range(weight.shape[1]):
for k in range(weight.shape[2]):
input[i][j][k] = input_quantized[i][stride*out_j+j][stride*out_k+k]
# print(np.dot(weight,input))
# print(input,"\n")
# print(weight)
return np.multiply(weight,input).sum()
def QuantizedConv2D(input_scale, input_zero, input_quantized, output_scale, output_zero, weight_scale, weight_zero, weight_quantized, bias, kernel_size, stride, padding, ofm_size):
output = np.zeros((weight_quantized.shape[0],ofm_size,ofm_size))
input_quantized_padding = np.full((input_quantized.shape[0],input_quantized.shape[1]+2*padding,input_quantized.shape[2]+2*padding),0)
zero_temp = np.full(input_quantized.shape,input_zero)
input_quantized = input_quantized - zero_temp
for i in range(input_quantized.shape[0]):
for j in range(padding,padding + input_quantized.shape[1]):
for k in range(padding,padding + input_quantized.shape[2]):
input_quantized_padding[i][j][k] = input_quantized[i][j-padding][k-padding]
zero_temp = np.full(weight_quantized.shape, weight_zero)
weight_quantized = weight_quantized - zero_temp
for i in range(output.shape[0]):
for j in range(output.shape[1]):
for k in range(output.shape[2]):
# output[i][j][k] = (weight_scale*input_scale)*conv_cal(input_quantized_padding, weight_quantized, kernel_size, stride, i, j, k) + bias[i] #floating_output
output[i][j][k] = weight_scale*input_scale/output_scale*conv_cal(input_quantized_padding, weight_quantized, kernel_size, stride, i, j, k) + bias[i]/output_scale + output_zero
output[i][j][k] = round(output[i][j][k])
# int_output
return output
Here I input the same image, weight, and bias together with their zero_point and scale, then compare this "numpy simulated" result to the PyTorch calculated one.
quantized_model_out1_int8 = np.squeeze(features['features.0.0'].int_repr().numpy())
print(quantized_model_out1_int8.shape)
print(quantized_model_out1_int8)
out1_np = QuantizedConv2D(input_scale, input_zero, input_quantized, output_scale, output_zero, weight_scale, weight_zero, weight_quantized, bias, 3, 2, 1, 112)
np.save("out1_np.npy",out1_np)
for i in range(quantized_model_out1_int8.shape[0]):
for j in range(quantized_model_out1_int8.shape[1]):
for k in range(quantized_model_out1_int8.shape[2]):
if(out1_np[i][j][k] < 0):
out1_np[i][j][k] = 0
print(out1_np)
flag = np.zeros(quantized_model_out1_int8.shape)
for i in range(quantized_model_out1_int8.shape[0]):
for j in range(quantized_model_out1_int8.shape[1]):
for k in range(quantized_model_out1_int8.shape[2]):
if(quantized_model_out1_int8[i][j][k] == out1_np[i][j][k]):
flag[i][j][k] = 1
out1_np[i][j][k] = 0
quantized_model_out1_int8[i][j][k] = 0
# Compare the simulated result to extractor fetched result, gain the total hit rate
print(flag.sum()/(112*112*32)*100,'%')
If the "numpy simulated" results are the same as the extracted one, call it a hit. Print the total hit rate, it shows that numpy gets 92% of the values right. Now the problem is, I have no idea why the rest 8% of values come out wrong.
Comparison of two outcomes:
The picture below shows the different values between Numpy one and PyTorch one, the sample channel is index[1]. The left upper corner is Numpy one, and the upright corner is PyTorch one, I have set all values that are the same between them to 0, as you can see, most of the values just have a difference of 1(This can be view as the error brought by the precision loss of fixed point arithmetics), but some have large differences, e.g. the value[1][4], 121 vs. 76 (I don't know why)
Focus on one strange value:
This code is used to replay the calculation process of the value[1][4], originally I was expecting a trial and error process could lead me to solve this problem, to get my wanted number of 76, but no matter how I tried, it didn't output 76. If you want to try this, I paste this code for your convenience.
#%% A test code to check the calculation process
weight_quantized_sample = weight_quantized[2]
M_t = input_scale * weight_scale / output_scale
ifmap_t = np.int32(input_quantized[:,1:4,7:10])
weight_t = np.int32(weight_quantized_sample)
bias_t = bias[2]
bias_q = bias_t/output_scale
res_t = 0
for ch in range(3):
ifmap_offset = ifmap_t[ch]-np.int32(input_zero)
weight_offset = weight_t[ch]-np.int32(weight_zero)
res_ch = np.multiply(ifmap_offset, weight_offset)
res_ch = res_ch.sum()
res_t = res_t + res_ch
res_mul = M_t*res_t
# for n in range(1, 30):
# res_mul = multiply(n, M_t, res_t)
res_t = round(res_mul + output_zero + bias_q)
print(res_t)
Could you help me out of this, have been stuck here for a long time.
I implemented my own version of quantized convolution and got from 99.999% to 100% hitrate (and mismatch of a single value is by 1 that I can consider to be a rounding issue). The link on the paper in the question helped a lot.
But I found that your formulas are the same as mine. So I don't know what was your issue. As I understand quantization in pytorch is hardware dependent.
Here is my code:
def my_Conv2dRelu_b2(input_q, conv_layer, output_shape):
'''
Args:
input_q: quantized tensor
conv_layer: quantized tensor
output_shape: the pre-computed shape of the result
Returns:
'''
output = np.zeros(output_shape)
# extract needed float numbers from quantized operations
weights_scale = conv_layer.weight().q_per_channel_scales()
input_scale = input_q.q_scale()
weights_zp = conv_layer.weight().q_per_channel_zero_points()
input_zp = input_q.q_zero_point()
# extract needed convolution parameters
padding = conv_layer.padding
stride = conv_layer.stride
# extract float numbers for results
output_zp = conv_layer.zero_point
output_scale = conv_layer.scale
conv_weights_int = conv_layer.weight().int_repr()
input_int = input_q.int_repr()
biases = conv_layer.bias().numpy()
for k in range(input_q.shape[0]):
for i in range(conv_weights_int.shape[0]):
output[k][i] = manual_convolution_quant(
input_int[k].numpy(),
conv_weights_int[i].numpy(),
biases[i],
padding=padding,
stride=stride,
image_zp=input_zp, image_scale=input_scale,
kernel_zp=weights_zp[i].item(), kernel_scale=weights_scale[i].item(),
result_zp=output_zp, result_scale=output_scale
)
return output
def manual_convolution_quant(image, kernel, b, padding, stride, image_zp, image_scale, kernel_zp, kernel_scale,
result_zp, result_scale):
H = image.shape[1]
W = image.shape[2]
new_H = H // stride[0]
new_W = W // stride[1]
results = np.zeros([new_H, new_W])
M = image_scale * kernel_scale / result_scale
bias = b / result_scale
paddedIm = np.pad(
image,
[(0, 0), (padding[0], padding[0]), (padding[1], padding[1])],
mode="constant",
constant_values=image_zp,
)
s = kernel.shape[1]
for i in range(new_H):
for j in range(new_W):
patch = paddedIm[
:, i * stride[0]: i * stride[0] + s, j * stride[1]: j * stride[1] + s
]
res = M * ((kernel - kernel_zp) * (patch - image_zp)).sum() + result_zp + bias
if res < 0:
res = 0
results[i, j] = round(res)
return results
Code to compare pytorch and my own version.
def calc_hit_rate(array1, array2):
good = (array1 == array2).astype(np.int).sum()
all = array1.size
return good / all
# during inference
y2 = model.conv1(y1)
y2_int = torch.int_repr(y2)
y2_int_manual = my_Conv2dRelu_b2(y1, model.conv1, y2.shape)
print(f'y2 hit rate= {calc_hit_rate(y2.int_repr().numpy(), y2_int_manual)}') #hit_rate=1.0

N-dimensional GP Regression

I'm trying to use GPflow for a multidimensional regression. But I'm confused by the shapes of the mean and variance.
For example: A 2-dimensional input space X of shape (20,20) is supposed to be predicted. My training samples are of shape (8,2) which means 8 training samples overall for the two dimensions. The y-values are of shape (8,1) which of course means one value of the ground truth per combination of the 2 input dimensions.
If I now use model.predict_y(X) I would expect to receive a mean of shape (20,20) but obtain a shape of (20,1). Same goes for the variance. I think that this problem comes from the shape of the y-values but I have have no idea how to fix it.
bound = 3
num = 20
X = np.random.uniform(-bound, bound, (num,num))
print(X_sample.shape) # (8,2)
print(Y_sample.shape) # (8,1)
k = gpflow.kernels.RBF(input_dim=2)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
gpflow.train.ScipyOptimizer().minimize(m)
mean, var = m.predict_y(X)
print(mean.shape) # (20, 1)
print(var.shape) # (20, 1)
It sounds like you may be confused between the shape of a grid of input positions and the shape of the numpy arrays: if you want to predict on a 20 x 20 grid in two dimensions, you have 400 points in total, each with 2 values. So X (the one that you pass to m.predict_y()) should have shape (400, 2). (Note that the second dimension needs to have the same shape as X_sample!)
To construct this array of shape (400,2) you can use np.meshgrid (e.g., see What is the purpose of meshgrid in Python / NumPy?).
m.predict_y(X) only predicts the marginal variance at each test point, so the returned mean and var both have shape (400,1) (same length as X). You can of course reshape them to the 20 x 20 values on your grid.
(It is also possible to compute the full covariance, for the latent f this is implemented as m.predict_f_full_cov, which for X of shape (400,2) would return a 400x400 matrix. This is relevant if you want consistent samples from the GP, but I suspect that goes well beyond this question.)
I was indeed making the mistake to not flatten the arrays which in return produced the mistake. Thank you for the fast response STJ!
Here is an example of the working code:
# Generate data
bound = 3.
x1 = np.linspace(-bound, bound, num)
x2 = np.linspace(-bound, bound, num)
x1_mesh,x2_mesh = np.meshgrid(x1, x2)
X = np.dstack([x1_mesh, x2_mesh]).reshape(-1, 2)
z = f(x1_mesh, x2_mesh) # evaluation of the function on the grid
# Draw samples from feature vectors and function by a given index
size = 2
np.random.seed(1991)
index = np.random.choice(range(len(x1)), size=(size,X.ndim), replace=False)
samples = utils.sampleFeature([x1,x2], index)
X1_sample = samples[0]
X2_sample = samples[1]
X_sample = np.column_stack((X1_sample, X2_sample))
Y_sample = utils.samplefromFunc(f=z, ind=index)
# Change noise parameter
sigma_n = 0.0
# Construct models with initial guess
k = gpflow.kernels.RBF(2,active_dims=[0,1], lengthscales=1.0,ARD=True)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
#print(X.shape)
mean, var = m.predict_y(X)
mean_square = mean.reshape(x1_mesh.shape) # Shape: (num,num)
var_square = var.reshape(x1_mesh.shape) # Shape: (num,num)
# Plot mean
fig = plt.figure(figsize=(16, 12))
ax = plt.axes(projection='3d')
ax.plot_surface(x1_mesh, x2_mesh, mean_square, cmap=cm.viridis, linewidth=0.5, antialiased=True, alpha=0.8)
cbar = ax.contourf(x1_mesh, x2_mesh, mean_square, zdir='z', offset=offset, cmap=cm.viridis, antialiased=True)
ax.scatter3D(X1_sample, X2_sample, offset, marker='o',edgecolors='k', color='r', s=150)
fig.colorbar(cbar)
for t in ax.zaxis.get_major_ticks(): t.label.set_fontsize(fontsize_ticks)
ax.set_title("$\mu(x_1,x_2)$", fontsize=fontsize_title)
ax.set_xlabel("\n$x_1$", fontsize=fontsize_label)
ax.set_ylabel("\n$x_2$", fontsize=fontsize_label)
ax.set_zlabel('\n\n$\mu(x_1,x_2)$', fontsize=fontsize_label)
plt.xticks(fontsize=fontsize_ticks)
plt.yticks(fontsize=fontsize_ticks)
plt.xlim(left=-bound, right=bound)
plt.ylim(bottom=-bound, top=bound)
ax.set_zlim3d(offset,np.max(z))
which leads to (red dots are the sample points drawn from the function). Note: Code not refactored what so ever :)

Deploy and use LeNet on Matlab using MatCaffe

I am having issue with MatCaffe. I trained LeNet using my own dataset (2 classification, 0 or 1) in python successfully and trying to deploy it on Matlab now. Net architecture is from caffe/examples/mnist/lenet.prototxt. All the input images I fed into the net always return 1. (i tried using both positive and negative images from training).
Below is my code:
deployNet = 'lenet_deploy.prototxt';
caffeModel = 'weight.caffemodel';
caffe.set_mode_cpu();
net = caffe.Net(deployNet, caffeModel, 'test');
net.blobs('data').reshape([28 28 1 1]);
net.reshape();
patch_data = imread('cropped.jpg'); % already in greyscale
patch_data = imresize(patch_data, [28 28],'bilinear');
imshow(patch_data)
input_data = {patch_data};
scores = net.forward(input_data);
highest = max(scores{1});
disp(i);
disp(highest);
The highest always return 1 even for negative image. I tried deploying it on python and it works great. I am guessing issue with the way I pre-process the input.
Found out the issue. I forgotten to multiply the image with the training scale and transpose the width and height since Matlab is 1-indexed and column-major, the usual 4 blob dimensions in Matlab are [width, height, channels, num], and width is the fastest dimension. So just add in 2 more line of codes:
deployNet = 'lenet_deploy.prototxt';
caffeModel = 'weight.caffemodel';
caffe.set_mode_cpu();
net = caffe.Net(deployNet, caffeModel, 'test');
net.blobs('data').reshape([28 28 1 1]);
net.reshape();
patch = imread('cropped.jpg'); % already in greyscale
patch = single(patch) * 0.00390625; % multiply with scale
patch = permute(patch, [2,1,3]); %permute width and height
input_data = {patch};
scores = net.forward(input_data);
highest = scores{1};

Predicting same values for entire test Set in MATLAB using LIBSVM

I am using Support Vector Regression(SVR) in libsvm package to predict outputs. Kernel : RBF
Train set size : 729x40
Test set size : 137x40
The output of train set seems fine when measured against ground truth. But the predictions on test set are all the same. It outputs same values.
After checking the related posts, I normalized the data and played with the values of gamma(10-100000) but still the problem persists.
trainGT=games(((games(:,46)>=2010) & (games(:,46)<2015) & (games(:,1)~=8)),43);
featuresTrain=lastGame(games,true,1);
testGT=games((games(:,46)>=2015 & (games(:,1)~=8)),43);
featureTest=lastGame(games,false,1);
eval(['model = svmtrain( trainGT, featuresTrain,''-s 4 -t 2 -c 10 -g 10 ' ''');']);
w = (model.sv_coef' * full(model.SVs));
b = -model.rho;
predictionsTrain = svmpredict(trainGT, featuresTrain,model);
predictionsTest = svmpredict(zeros(length(testGT),1), featureTest, model);
My output is as follows
optimization finished, #iter = 1777
epsilon = 0.630588
obj = -19555.036253, rho = -17.470386
nSV = 681, nBSV = 118
Mean squared error = 305.214 (regression)
Squared correlation coefficient = -1.#IND (regression)
All my predictionTest values are 17.4704 (which is same as the rho value in the output). Can someone please help me on this? Thanks.

image size using imread and size do not match - matlab

Why when I read an image im = imread('pears.png') I see the in workspace 486x732x3 unit 8 but when I run [height, width] = size(im) I get height 486 and width 2196?. When I read m83.tif all is ok. is it related to color? which type of file will be multiply by 3 than?
Following the documentation of imread:
A = imread(FILENAME,FMT) ...
If the file contains a grayscale image, A is an M-by-N array
If the file contains a truecolor image, A is an M-by-N-by-3 array.
Thus, what you want to have are only the first dimensions. However, since im is three dimensional and size will return the size of a reshaped array when only querying two dimensions. You will see that your second dimeions is the product size(im,2)*size(im,3).
Just use
[height, width, ~] = size(im)
This will query three dimensions but drop the third output parameter. Also using:
height = size(im,1);
width = size(im,2);
will work and is a save solution.
See following code for more clarifictaion:
>> im = ones([40,20,3]);
>> [h,w] = size(im)
h =
40
w =
60
>> [h,w,~] = size(im)
h =
40
w =
20
>> h = size(im,1)
h =
40
>> w = size(im,2)
w =
20