I'm making a neural network that'll classify multiple bluetooth RSSI values into certain locations.
For example, this is the normalized RSS input from multiple bluetooth receivers:
[0.1, 0.6, 0, 0, 0, 0, 0, 0, 0.8, 0.6599999999999999, 0.9, 0.36317567567567566]
And this would be the output, classifying it into a location:
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
This is the model:
# create model
model = Sequential()
# Input layer has 12 inputs and 24 neurons
model.add(Dense(24, input_dim=12, init='normal', activation='relu'))
# Output layer, has 11 outputs
model.add(Dense(11, init='uniform', activation='sigmoid'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
I train the network with 3226 training samples, and test it afterwards with 263 extra samples.
After the first 50 epochs the training samples have an accuracy of about 85%. The test samples are then about 74% accurate.
But if I continue on training, the accuracy of the test samples actually goes down. 50 more epochs of testing will result in an 88% accuracy of the training samples and a 62% accuracy in the test samples.
I've tried multiple objective functions, but the result is the same: the more I train, the worse it gets.
Could this be due to the loss function that tries to be too binary, perhaps?
Is there a loss function available that would positively reward the result if the correct class just has the highest value of them all?
Related
My training data shape is shown as follow:
x_train (5000, 300)
y_train (5000, 500)
So I am using 300 datapoints to prediction 500 datapoints, and there are 5000 sets for training.
When use ANN model to make prediction. It is straight forward:
model = Sequential()
model.add(Dense(50, input_dim = x_train.shape[1], activation = 'relu'))
model.add(Dense(50, activation = 'relu'))
model.add(Dense(output))
model.compile(optimizer='adam', loss='mse')
model.fit(x_train, y_train, validation_data=(x_vali, y_vali), epochs =30, batch_size = 64,verbose=1, callbacks=[early_stop])
However, I am not sure how to change to Gaussian Process Neural Network or Bayesian Neural Networks
I am trying to make a binary classification on a subset of MNIST dataset. The goal is to predict whether a sample is 6 or 8. So, I have 784 pixel features for each sample and 8201 samples in the dataset. I built a network of one input layer, 2 hidden layers and one output layer. I am using sigmoid as activation function to output layer and relu for the hidden layers. I have no idea why I am getting a 0% accuracy at the end.
#import libraries
from keras.models import Sequential
from keras.layers import Dense
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os
np.random.seed(7)
os.chdir('C:/Users/olivi/Documents/Python workspace')
#data loading
data = pd.read_csv('MNIST_CV.csv')
#Y target label
Y = data.iloc[:,0]
#X: features
X = data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.25,random_state=42)
# create model
model = Sequential()
model.add(Dense(392,kernel_initializer='normal',input_dim=784,
activation='relu'))
model.add(Dense(196,kernel_initializer='normal', activation='relu'))
model.add(Dense(98,kernel_initializer='normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# Training the model
model.fit(X_train, y_train, epochs=100, batch_size=50)
print(model.predict(X_test,batch_size= 50))
score = model.evaluate(X_test, y_test)
print("\n Testing Accuracy:", score[1])
If you use binary cross-entropy, your labels should be either 0 or 1 (representing "is not number 6" or "is number 6" respectively).
If your Y target labels right now are the values 6 and 8, it'll fail.
Once you are choosing a subset of MNIST, you have to be sure how many different classes of digits there is in your sample (both training and test set).
So:
classes=len(np.unique(Y))
Then you should hot encode Y:
Y_train = np_utils.to_categorical(y_train, classes)
Y_test = np_utils.to_categorical(y_test, classes)
After that, change the last layer of your neural net to:
model.add(Dense(classes, activation='sigmoid'))
Finally:
model.predict_classes(X_test,batch_size= 50)
Be sure both training and test set have the same number of classes for Y.
After the prediction, find where 6 and 8's are located using np.where(), select this subsample and test your accuracy.
I have inputs that are binary (0,1) and outputs that are binary (0,1). More than 80% of the time, the binary input is equal to the binary output. However, when I train a keras neural network I get an accuracy that goes to .6. There are 1000 such inputs. Here is the network setup in Keras:
model = Sequential()
model = Sequential()
model.add(Dense(12, input_dim=1, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')
This seems very strange. What could the problem be?
Assume that we have a Convolutional Neural Network trained to classify (w.l.o.g. grayscale) images, in Tensor-Flow.
Given the trained net and a test image one can trace which pixels of it are salient, or "equivalently" which pixels are most responsible for the output classification of the image. A nice, explanation and implementation details in Theano, are given in this article.
Assume that for the first layer of convolutions that is directly linked with the input image, we do have the gradient for the parameters of every convolutional kernel-wrt. the classification function.
How can one propagate the gradient back to the Input layer, so to compute a partial derivative on every pixel of the image?
Propagating and accumulating back the gradient, would give us the salient pixels (they are those with big in-magnitude derivative).
To find the gradient wrt. the kernels of the first layer, so far I did:
Replaced the usual loss operator with the output layer operator.
Used the "compute_gradient" function,
All in all, it looks like:
opt = tf.train.GradientDescentOptimizer(1)
grads = opt.compute_gradients(output)
grad_var = [(grad1) for grad in grads]
g1 = sess.run([grad_var[0]])
Where, the "output" is the max of the output layer of the NN.
And g1, is a (k, k, 1, M) tensor, since I used M: k x k convolutional kernels on the first layer.
Now, I need to find the correct way to propagate g1 on every input pixel, as to compute their derivative wrt. the output.
To compute the gradients, you don't need to use an optimizer, and you can directly use tf.gradients.
With this function, you can directly compute the gradient of output with respect to the image input, whereas the optimizer compute_gradients method can only compute gradients with respect to Variables.
The other advantage of tf.gradients is that you can specify the gradients of the output you want to backpropagate.
So here is how to get the gradients of an input image with respect to output[1, 1]:
we have to set the output gradients to 0 everywhere except at indice [1, 1]
input = tf.ones([1, 4, 4, 1])
filter = tf.ones([3, 3, 1, 1])
output = tf.nn.conv2d(input, filter, [1, 1, 1, 1], 'SAME')
grad_output = np.zeros((1, 4, 4, 1), dtype=np.float32)
grad_output[0, 1, 1, 0] = 1.
grads = tf.gradients(output, input, grad_output)
sess = tf.Session()
print sess.run(grads[0]).reshape((4, 4))
# prints [[ 1. 1. 1. 0.]
# [ 1. 1. 1. 0.]
# [ 1. 1. 1. 0.]
# [ 0. 0. 0. 0.]]
I am trying to implement neural network with RELU.
input layer -> 1 hidden layer -> relu -> output layer -> softmax layer
Above is the architecture of my neural network.
I am confused about backpropagation of this relu.
For derivative of RELU, if x <= 0, output is 0.
if x > 0, output is 1.
So when you calculate the gradient, does that mean I kill gradient decent if x<=0?
Can someone explain the backpropagation of my neural network architecture 'step by step'?
if x <= 0, output is 0. if x > 0, output is 1
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.
If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.
See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.
I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.
Here is a good example, use ReLU to implement XOR:
reference, http://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # the derivate of ReLU
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
plt.plot(loss_col)
plt.show()
More about the derivate of ReLU, you can see here: http://kawahara.ca/what-is-the-derivative-of-relu/
So when you calculate the gradient, does that mean I kill gradient
decent if x <= 0?
Yes! If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).
Can someone explain the backpropagation of my neural network architecture 'step by step'?
A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:
f and g represent Relu and sigmoid, respectively, and b represents bias.
Step 1:
First, the output is calculated:
This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively.
So h is the estimated value. Suppose the real value is y.
Weights are now updated with backpropagation.
The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:
In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:
The three general terms used above are:
The difference between the actual value and the estimated value
Neuron output square
And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:
And the above statement does not necessarily become zero.
Now we go to the second layer. In the second layer we will have:
It consisted of 4 main terms:
The difference between the actual value and the estimated value.
Neuron output square
The sum of the loss derivatives of the connected neurons in the next layer
A derivative of the activator function and since the activator function is Relu we will have:
if z2<=0 (z2 is the input of Relu function):
Otherwise, it's not necessarily zero:
So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.
*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.
The example given is a very simple example to illustrate the backpropagation process.
Yes the orginal Relu function has the problem you describe.
So they later made a change to the formula, and called it leaky Relu
In essence Leaky Relu tilts the horizontal part of the function slightly by a very small amount. for more information watch this :
An explantion of activation methods, and a improved Relu on youtube
Additionally, here you can find an implementation in caffe framework: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/relu_layer.cpp
The negative_slope specifies whether to "leak" the negative part by multiplying it with the slope value rather than setting it to 0. Of course you should set this parameter to zero to have classical version.