Padding with even kernel size in a convolutional layer in Keras (Theano) - neural-network

I need to now how data is padded in a 1d convolutional layer using Keras with Theano as backend. I use a "same" padding.
Assuming we have an output_length of 8 and a kernel_size of 4. According to the original Keras code we have padding of 8//4 == 2. However, when adding two zeros at the left and the right end of my horizontal data, I could compute 9 convolutions instead of 8.
Can somebody explain me how data is padded? Where are zeros added and how do I compute the number of padding values on the right and left side of my data?

How to test the way keras pads the sequences:
A very simple test you can do is to create a model with a single convolutional layer, enforce its weights to be 1 and its biases to be 0, and give it an input with ones to see the output:
from keras.layers import *
from keras.models import Model
import numpy as np
#creating the model
inp = Input((8,1))
out = Conv1D(filters=1,kernel_size=4,padding='same')(inp)
model = Model(inp,out)
#adjusting the weights
ws = model.layers[1].get_weights()
ws[0] = np.ones(ws[0].shape) #weights
ws[1] = np.zeros(ws[1].shape) #biases
model.layers[1].set_weights(ws)
#predicting the result for a sequence with 8 elements
testData=np.ones((1,8,1))
print(model.predict(testData))
The output of this code is:
[[[ 2.] #a result 2 shows only 2 of the 4 kernel frames were activated
[ 3.] #a result 3 shows only 3 of the 4 kernel frames were activated
[ 4.] #a result 4 shows the full kernel was used
[ 4.]
[ 4.]
[ 4.]
[ 4.]
[ 3.]]]
So we can conclude that:
Keras adds the padding before performing the convolutions, not after. So the results are not "zero".
Keras distributes the padding equally, and when there is an odd number, it goes first.
So, it made the input data look like this before applying the convolutions
[0,0,1,1,1,1,1,1,1,1,0]

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

non-linear neural network regression - quadratic function is not being estimated correctly

I have mostly used ANNs for classification and only recently started to try them out for modeling continuous variables. As an exercise I generated a simple set of (x, y) pairs where y = x^2 and tried to train an ANN to learn this quadratic function.
The ANN model:
This ANN has 1 input node (ie. x), 2 hidden layers each with 2 nodes in each layer, and 1 output node. All four hidden nodes use the non-linear tanh activation function and the output node has no activation function (since it is regression).
The Data:
For the training set I randomly generated 100 numbers between (-20, 20) for x and computed y=x^2. For the testing set I randomly generated 100 numbers between (-30, 30) for x and also computed y=x^2. I then transformed all x so that they are centered around 0 and their min and max are approximately around -1.5 and 1.5. I also transformed all y similarly but made their min and max about -0.9 and 0.9. This way, all the data falls within that mid range of the tanh activation function and not way out at the extremes.
The Problem:
After training the ANN in Keras, I am seeing that only the right half of the polynomial function is being learned, and the left half is completely flat. Does anyone have any ideas why this may be happening? I tried playing around with different scaling options, as well as hidden layer specifications but no luck on that left side.
Thanks!
Attached is the code I used for everything and the image shows the plot of the scaled training x vs the predicted y. As you can see, only half of the parabola is recovered.
import numpy as np, pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
seed = 10
n = 100
X_train = np.random.uniform(-20, 20, n)
Y_train = X_train ** 2
X_test = np.random.uniform(-30, 30, n)
Y_test = X_test ** 2
#### Scale the data
x_cap = max(abs(np.array(list(X_train) + list(X_test))))
y_cap = max(abs(np.array(list(Y_train) + list(Y_test))))
x_mean = np.mean(np.array(list(X_train) + list(X_test)))
y_mean = np.mean(np.array(list(Y_train) + list(Y_test)))
X_train2 = (X_train-x_mean) / x_cap
X_test2 = (X_test-x_mean) / x_cap
Y_train2 = (Y_train-y_mean) / y_cap
Y_test2 = (Y_test-y_mean) / y_cap
X_train2 = X_train2 * (1.5 / max(X_train2))
Y_train2 = Y_train2 * (0.9 / max(Y_train2))
# define base model
def baseline_model1():
# create model
model1 = Sequential()
model1.add(Dense(2, input_dim=1, kernel_initializer='normal', activation='tanh'))
model1.add(Dense(2, input_dim=1, kernel_initializer='normal', activation='tanh'))
model1.add(Dense(1, kernel_initializer='normal'))
# Compile model
model1.compile(loss='mean_squared_error', optimizer='adam')
return model1
np.random.seed(seed)
estimator1 = KerasRegressor(build_fn=baseline_model1, epochs=100, batch_size=5, verbose=0)
estimator1.fit(X_train2, Y_train2)
prediction = estimator1.predict(X_train2)
plt.scatter(X_train2, prediction)
enter image description here
You should also consider adding more width to you hidden layer. I changed from 2 to 5 and got a very good fit. I also used more epochs as suggested from rvinas
Your network is very sensible to the initial parameters. The following will help:
Change your kernel_initializer to glorot_uniform. Your network is very small and glorot_uniform will work better in consonance with the tanh activations. Glorot uniform will encourage your weights to be initially within a more reasonable range (since it takes into account the fan-in and fan-out of each layer).
Train your model for more epochs (i.e. 1000).

How to use groups parameter in PyTorch conv2d function

I am trying to compute a per-channel gradient image in PyTorch. To do this, I want to perform a standard 2D convolution with a Sobel filter on each channel of an image. I am using the torch.nn.functional.conv2d function for this
In my minimum working example code below, I get an error:
import torch
import torch.nn.functional as F
filters = torch.autograd.Variable(torch.randn(1,1,3,3))
inputs = torch.autograd.Variable(torch.randn(1,3,10,10))
out = F.conv2d(inputs, filters, padding=1)
RuntimeError: Given groups=1, weight[1, 1, 3, 3], so expected
input[1, 3, 10, 10] to have 1 channels, but got 3 channels instead
This suggests that groups need to be 3. However, when I make groups=3, I get a different error:
import torch
import torch.nn.functional as F
filters = torch.autograd.Variable(torch.randn(1,1,3,3))
inputs = torch.autograd.Variable(torch.randn(1,3,10,10))
out = F.conv2d(inputs, filters, padding=1, groups=3)
RuntimeError: invalid argument 4: out of range at
/usr/local/src/pytorch/torch/lib/TH/generic/THTensor.c:440
When I check that code snippet in the THTensor class, it refers to a bunch of dimension checks, but I don't know where I'm going wrong.
What does this error mean? How can I perform my intended convolution with this conv2d function? I believe I am misunderstanding the groups parameter.
If you want to apply a per-channel convolution then your out-channel should be the same as your in-channel. This is expected, considering each of your input channels creates a separate output channel that it corresponds to.
In short, this will work
import torch
import torch.nn.functional as F
filters = torch.autograd.Variable(torch.randn(3,1,3,3))
inputs = torch.autograd.Variable(torch.randn(1,3,10,10))
out = F.conv2d(inputs, filters, padding=1, groups=3)
whereas, filters of size (2, 1, 3, 3) or (1, 1, 3, 3) will not work.
Additionally, you can also make your out-channel a multiple of in-channel. This works for instances where you want to have multiple convolution filters for each input channel.
However, This only makes sense if it is a multiple. If not, then pytorch falls back to its closest multiple, a number less than what you specified. This is once again expected behavior. For example a filter of size (4, 1, 3, 3) or (5, 1, 3, 3), will result in an out-channel of size 3.

Using Keras LSTM to predict a single example after using batch training

I have a network model that is trained using batch training. Once it is trained, I want to predict the output for a single example.
Here is my model code:
model = Sequential()
model.add(Dense(32, batch_input_shape=(5, 1, 1)))
model.add(LSTM(16, stateful=True))
model.add(Dense(1, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
I have a sequence of single inputs to single outputs. I'm doing some test code to map characters to next characters (A->B, B->C, etc).
I create an input data of shape (15,1,1) and an output data of shape (15, 1) and call the function:
model.fit(x, y, nb_epoch=epochs, batch_size=5, shuffle=False, verbose=0)
The model trains, and now I want to take a single character and predict the next character (input A, it predicts B). I create an input of shape (1, 1, 1) and call:
pred = model.predict(x, batch_size=1, verbose=0)
This gives:
ValueError: Shape mismatch: x has 5 rows but z has 1 rows
I saw one solution was to add "dummy data" to your predict values, so the input shape for the prediction would be (5,1,1) with data [x 0 0 0 0] and you would just take the first element of the output as your value. However, this seems inefficient when dealing with larger batches.
I also tried to remove the batch size from the model creation, but I got the following message:
ValueError: If a RNN is stateful, a complete input_shape must be provided (including batch size).
Is there another way? Thanks for the help.
Currently (Keras v2.0.8) it takes a bit more effort to get predictions on single rows after training in batch.
Basically, the batch_size is fixed at training time, and has to be the same at prediction time.
The workaround right now is to take the weights from the trained model, and use those as the weights in a new model you've just created, which has a batch_size of 1.
The quick code for that is
model = create_model(batch_size=64)
mode.fit(X, y)
weights = model.get_weights()
single_item_model = create_model(batch_size=1)
single_item_model.set_weights(weights)
single_item_model.compile(compile_params)
Here's a blog post that goes into more depth:
https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
I've used this approach in the past to have multiple models at prediction time- one that makes predictions on big batches, one that makes predictions on small batches, and one that makes predictions on single items. Since batch predictions are much more efficient, this gives us the flexibility to take in any number of prediction rows (not just a number that is evenly divisible by batch_size), while still getting predictions pretty rapidly.
#ClimbsRocks showed a nice workaround. I cannot provide a "correct" answer in sense of "this is how Keras intends it to be done", but I can share another workaround which might help somebody depending on the use-case.
In this workaround I use predict_on_batch(). This method allows to pass a single sample out of a batch without throwing an error. Unfortunately, it returns a vector in the shape the target has according to the training-settings. However, each sample in the target yields then the prediction for your single sample.
You can access it like this:
to_predict = #Some single sample that would be part of a batch (has to have the right shape)#
model.predict_on_batch(to_predict)[0].flatten() #Flatten is optional
The result of the prediction is exactly the same as if you would pass an entire batch to predict().
Here some cod-example.
The code is from my question which also deals with this issue (but in a sligthly different manner).
sequence_size = 5
number_of_features = 1
input = (sequence_size, number_of_features)
batch_size = 2
model = Sequential()
#Of course you can replace the Gated Recurrent Unit with a LSTM-layer
model.add(GRU(100, return_sequences=True, activation='relu', input_shape=input, batch_size=2, name="GRU"))
model.add(GRU(1, return_sequences=True, activation='relu', input_shape=input, batch_size=batch_size, name="GRU2"))
model.compile(optimizer='adam', loss='mse')
model.summary()
#Summary-output:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
GRU (GRU) (2, 5, 100) 30600
_________________________________________________________________
GRU2 (GRU) (2, 5, 1) 306
=================================================================
Total params: 30,906
Trainable params: 30,906
Non-trainable params: 0
def generator(data, batch_size, sequence_size, num_features):
"""Simple generator"""
while True:
for i in range(len(data) - (sequence_size * batch_size + sequence_size) + 1):
start = i
end = i + (sequence_size * batch_size)
yield data[start : end].reshape(batch_size, sequence_size, num_features), \
data[end - ((sequence_size * batch_size) - sequence_size) : end + sequence_size].reshape(batch_size, sequence_size, num_features)
#Task: Predict the continuation of a linear range
data = np.arange(100)
hist = model.fit_generator(
generator=generator(data, batch_size, sequence_size, num_features),
steps_per_epoch=total_batches,
epochs=200,
shuffle=False
)
to_predict = np.asarray([[np.asarray([x]) for x in range(95,100,1)]]) #Only single element of a batch
correct = np.asarray([100,101,102,103,104])
print( model.predict_on_batch(to_predict)[0].flatten() )
#Output:
[ 99.92908 100.95854 102.32129 103.28584 104.20213 ]

(matlab) MLP with relu and softmax not working with mini-batch SGD and produces similar predictions on MNIST dataset

I implemented a multilayer perceptron with 1 hidden layer on MNIST dataset. The activation function in hidden layer is leaky(0.01) ReLu and output layer has a softmax activation function. The learning method is mini-batch SGD. The network structure is 784*30*10. The problem is I found the predictions the network made, for each input sample, are quite similar. That means the model would always like to think the image is some certain number. Thanks #Lemm Ras for pointing out the label-data mismatching problem in previous data_shuffle function and now fixed. But after some batch training, I found the predictions are still some kind of similar: That's confusing.
Another issue is the update value is too small comparing with original weight, in the MLP code, I add variable 'cc' and 'dd' to record the ratio between their weight_update and weight,
cc=W_OUTPUT_Update./W_OUTPUT;
dd=W_MLP_Update./W_MLP;
During debugging, the magnitude for cc is 10^-4(0.0001) and dd is also 10^-4. This might be the reason that the accuracy doesn't seems improved a lot.
After several days debugging. I have no idea why that happens and how to solve it, it made me stuck for one week. Can someone help me please?
The screenshot is the value of A2 after softmax function.
[dimension, images, labels, labels_matrix, train_amount, test_labels_matrix, test_images, test_labels, test_amount] = load_mnist_data(); %initialize str
images=images(:,1:10000); % for debugging, get part of whole data set
labels=labels(1:10000,1);
labels_matrix=labels_matrix(:,1:10000);
test_images=test_images(:,1:500);
test_labels=test_labels(1:500,1);
train_amount=10000;
test_amount=500;
% initialize the structure
[ W_MAD, W_MLP, W_OUTPUT] = initialize_structure(dimension, train_amount, test_amount);
epoch=100;
correct_rate=zeros(1,epoch); %record testing accuracy
corr=zeros(1,epoch); %record training accuracy
lr=0.2;
lamda=0;
batch_size=50;
for i=1:epoch
sprintf('MLP in iteration %d over %d', i, epoch)
%shuffle data
[labels_shuffled labels_matrix_shuffled images_shuffled]=shuffle_data(labels, labels_matrix,images);
[ cor, W_MLP, W_OUTPUT ] = train_mlp_relu(lr, leaky, lamda, momentum_gamma, batch_size,W_MLP, W_OUTPUT, W_MAD, power, images_shuffled, train_amount, labels_shuffled, labels_matrix_shuffled);
corr(i)=cor/train_amount;
% test
correct_rate(i) = structure_test( W_MAD, W_MLP, W_OUTPUT, test_images, test_labels, test_amount );
end
% plot results
plot(1:epoch,correct_rate);
Here's the training MLP function, please ignore L2 regularization parameter lamda which is currently set as 0.
%MLP with batch size batch_size
cor=0;
%leaky=(1/batch_size);
leaky=0.001;
for i=1:train_amount/batch_size
batch_images=images(:,batch_size*(i-1)+1:batch_size*i);
batch_labels=labels_matrix(:,batch_size*(i-1)+1:batch_size*i);
%from MAD to MLP
V1=W_MLP'*batch_images;
V1(1,:)=1; %set bias unit as 1
V1_dirivative=ones(size(V1));
V1_dirivative(find(V1<0))=leaky;
A1=relu(V1,leaky); % A stands for activation
V2=W_OUTPUT'* A1;
A2=softmax(V2);
%write these scope control codes into functions.
%train error
[val idx]=max(A2);
idx=idx-1; %because index(idx) for matrix vaires from 1 to 10 while label varies from 0 to 9.
res=labels(batch_size*(i-1)+1:batch_size*i)-idx';
cor=cor+sum(res(:)==0);
%softmax loss, due to relu applied nodes that has
%contribution to activate neurons has gradient 1; while <0 nodes
%has no contribution
delta_softmax=-(1/batch_size)*(batch_labels-A2);
delta_output=W_OUTPUT*delta_softmax.*V1_dirivative;
%update
W_OUTPUT_Update=lr*(1/batch_size)*A1*delta_softmax'+lamda*W_OUTPUT;
cc=W_OUTPUT_Update./W_OUTPUT;
W_MLP_Update=lr*(1/batch_size)*batch_images*delta_output'+lamda*W_MLP;
dd=W_MLP_Update./W_MLP;
k=mean(A2,2);
W_OUTPUT=W_OUTPUT-W_OUTPUT_Update;
W_MLP=W_MLP-W_MLP_Update;
end
end
Here is the softmax function:
function [ val ] = softmax( val )
val=exp(val);
val=val./repmat(sum(val),10,1);
end
The labels_matrix is the aimed output matrix for A2 and created as:
labels_matrix=full(sparse(labels+1,1:train_amount,1));
test_labels_matrix=full(sparse(test_labels+1,1:test_amount,1));
And Relu:
function [ val ] = relu( val,leaky )
val(find(val<0))=leaky*val(find(val<0));
end
Data shuffle
%this version is wrong, due to it only shuffles label and data without doing the same shuffling on the 'labels_matrix' which is used to calculate MLP's delta in output layer. It destroyed the link between data and label.
% function [ label, data ] = shuffle_data( label, data )
% [row column]=size(data);
% array=randperm(column);
% data=data(:,array);
% label=label(array);
% %if shuffle respect to row then use the code below
% %data=data(randperm(row),:);
% end
function [ label, label_matrix, data ] = shuffle_data( label, label_matrix, data )
[row column]=size(data);
array=randperm(column);
data=data(:,array);
label=label(array);
label_matrix=label_matrix(:, array);
%if shuffle respect to row then use the code below
%data=data(randperm(row),:);
end
Data loading:
function [ dimension, images, labels, labels_matrix, train_amount, test_labels_matrix, test_images, test_labels, test_amount] = load_mnist_data()
%%load training and testing data, labels
data_location='C:\Users\yz39g15\Documents\MATLAB\common\mnist test\for the report/modify/train-images.idx3-ubyte';
label_location='C:\Users\yz39g15\Documents\MATLAB\common\mnist test\for the report/modify/train-labels.idx1-ubyte';
test_data_location='C:\Users\yz39g15\Documents\MATLAB\common\mnist test\for the report/modify/t10k-images.idx3-ubyte';
test_label_location='C:\Users\yz39g15\Documents\MATLAB\common\mnist test\for the report/modify/t10k-labels.idx1-ubyte';
images = loadMNISTImages(data_location);
labels = loadMNISTLabels(label_location);
test_images=loadMNISTImages(test_data_location);
test_labels=loadMNISTLabels(test_label_location);
%%data centralization
[dimension train_amount]=size(images);
[dimension test_amount]=size(test_images);
%%complete normalization
%%transform labels from index to matrix in order to apply square loss function in output layer
labels_matrix=full(sparse(labels+1,1:train_amount,1));
test_labels_matrix=full(sparse(test_labels+1,1:test_amount,1));
end
When you are shuffling the images, the association data-label is lost. Since this association must survive, what you need is to enforce the same shuffling for both data and labels.
In order to do so you could, for instance, Create an external shuffled index list: shuffled=randperm(N), with N the number of images and then pass to the train method either the list created or the elements of images and label addressed by the shuffled list.