Evaluating Neural Network using accuracy in Keras model - neural-network

In Keras, when training and evaluating a Neural Network model (classify two classes (0 and 1)), the model returns loss and accuracy for both training and testing:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
What does this accuracy represent? Is it the mean accuracy for the two classes or the accuracy for one of the two classes?

Accuracy is the number of correctly classified samples divided by the number of all samples. It does not involve any per class accuracies.
Here is for example the code Keras uses to compute the binary accuracy:
K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)

Keras will choose from a list of possible metrics in its code. From the metrics source code, there are five possibilities:
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
def categorical_accuracy(y_true, y_pred):
return K.cast(K.equal(K.argmax(y_true, axis=-1),
K.argmax(y_pred, axis=-1)),
K.floatx())
def sparse_categorical_accuracy(y_true, y_pred):
return K.cast(K.equal(K.max(y_true, axis=-1),
K.cast(K.argmax(y_pred, axis=-1), K.floatx())),
K.floatx())
def top_k_categorical_accuracy(y_true, y_pred, k=5):
return K.mean(K.in_top_k(y_pred, K.argmax(y_true, axis=-1), k), axis=-1)
def sparse_top_k_categorical_accuracy(y_true, y_pred, k=5):
return K.mean(K.in_top_k(y_pred, K.cast(K.max(y_true, axis=-1), 'int32'), k), axis=-1)
The choice depends on what type of model and loss function you have. In the training module, you see it choosing the accuracy function:
if (output_shape[-1] == 1 or self.loss_functions[i] == losses.binary_crossentropy):
# case: binary accuracy
acc_fn = metrics_module.binary_accuracy
elif self.loss_functions[i] == losses.sparse_categorical_crossentropy:
# case: categorical accuracy with sparse targets
acc_fn = metrics_module.sparse_categorical_accuracy
else:
acc_fn = metrics_module.categorical_accuracy
In your model, you have 2 outputs and a categorical_crossentropy loss, so you will fall in case 3, and your accuracy will be:
def categorical_accuracy(y_true, y_pred):
return K.cast(K.equal(K.argmax(y_true, axis=-1),
K.argmax(y_pred, axis=-1)),
K.floatx())
Translating, your model expects only one class to be true, if the index of the predicted class with maximum value is equal to the index of the true class, it counts as right.
Example:
predicted: [0.7 ; 0.3] /// true: [1 ; 0] --- counts as right
predicted: [0.8 ; 0,2] /// true: [0 ; 1] --- counts as wrong

Related

Neural Network Always Predicting Average Value

I'm trying to train a neural network to approximate a known scalar function of two variables; however, no matter the parameters of my training, the network always just ends up simply predicting the average value of the true outputs.
I am using an MLP and have tried:
using several network depths and widths
different optimizers (SGD and ADAM)
different activations (ReLU and Sigmoid)
changing the learning rate (several points within the range 0.1 to 0.001)
increasing the data (to 10,000 points)
increasing the number of epochs (to 2,000)
and different random seeds
to no avail.
My loss function is MSE and always plateaus to a value of about 5.14.
Regardless of changes I make, I get the following results:
Where the blue surface is the function to be approximated, and the green surface is the MLP approximation of the function, having a value that is roughly the average of the true function over that domain (the true average is 2.15 with a square of 4.64 - not far from the loss plateau value).
I feel like I could be missing something very obvious and have just been looking at it for too long. Any help is greatly appreciated! Thanks
I've attached my code here (I'm using JAX):
import jax.numpy as jnp
from jax import grad, jit, vmap, random, value_and_grad
import flax
import flax.linen as nn
import optax
seed = 2
key, data_key = random.split(random.PRNGKey(seed))
x1, x2, y= generate_data(data_key) # Data generation function
# Using Flax - define an MLP
class MLP(nn.Module):
features: Sequence[int]
#nn.compact
def __call__(self, x):
for feat in self.features[:-1]:
x = nn.relu(nn.Dense(feat)(x))
x = nn.Dense(self.features[-1])(x)
return x
# Define function that returns JITted loss function
def make_mlp_loss(input_data, true_y):
def mlp_loss(params):
pred_y = model.apply(params, input_data)
loss_vector = jnp.square(true_y.reshape(-1) - pred_y)
return jnp.average(loss_vector)
# Outer scope incapsulation saves the data and true output
return jit(mlp_loss)
# Concatenate independent variable vectors to be proper input shape
input_data = jnp.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))
# Create loss function with data and true output
mlp_loss = make_mlp_loss(input_data, y)
# Create function that returns loss and gradient
loss_and_grad = value_and_grad(mlp_loss)
# Example architectures I've tried
architectures = [[16, 16, 1], [8, 16, 1], [16, 8, 1], [8, 16, 8, 1], [32, 32, 1]]
# Only using one seed but iterated over several
for seed in [645]:
for architecture in architectures:
# Create model
model = MLP(architecture)
# Initialize model with random parameters
key, params_key = random.split(key)
dummy = jnp.ones((1000, 2))
params = model.init(params_key, dummy)
# Create optimizer
opt = optax.adam(learning_rate=0.01) #sgd
opt_state = opt.init(params)
epochs = 50
for i in range(epochs):
# Get loss and gradient
curr_loss, curr_grad = loss_and_grad(params)
if i % 5 == 0:
print(curr_loss)
# Update
updates, opt_state = opt.update(curr_grad, opt_state)
params = optax.apply_updates(params, updates)
print(f"Architecture: {architecture}\nLoss: {curr_loss}\nSeed: {seed}\n\n")

Fitting a neural network with ReLUs to polynomial functions

Out of curiosity I am trying to fit neural network with rectified linear units to polynomial functions.
For example, I would like to see how easy (or difficult) it is for a neural network to come up with an approximation for the function f(x) = x^2 + x. The following code should be able to do it, but seems to not learn anything. When I run
using Base.Iterators: repeated
ENV["JULIA_CUDA_SILENT"] = true
using Flux
using Flux: throttle
using Random
f(x) = x^2 + x
x_train = shuffle(1:1000)
y_train = f.(x_train)
x_train = hcat(x_train...)
m = Chain(
Dense(1, 45, relu),
Dense(45, 45, relu),
Dense(45, 1),
softmax
)
function loss(x, y)
Flux.mse(m(x), y)
end
evalcb = () -> #show(loss(x_train, y_train))
opt = ADAM()
#show loss(x_train, y_train)
dataset = repeated((x_train, y_train), 50)
Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))
println("Training finished")
#show m([20])
it returns
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
Training finished
m([20]) = Float32[1.0]
Anyone here sees how I could make the network fit f(x) = x^2 + x?
There seem to be couple of things wrong with your trial that have mostly to do with how you use your optimizer and treat your input -- nothing wrong with Julia or Flux. Provided solution does learn, but is by no means optimal.
It makes no sense to have softmax output activation on a regression problem. Softmax is used in classification problems where the output(s) of your model represent probabilities and therefore should be on the interval (0,1). It is clear your polynomial has values outside this interval. It is usual to have linear output activation in regression problems like these. This means in Flux no output activation should be defined on the output layer.
The shape of your data matters. train! computes gradients for loss(d...) where d is a batch in your data. In your case a minibatch consists of 1000 samples, and this same batch is repeated 50 times. Neural nets are often trained with smaller batches sizes, but a larger sample set. In the code I provided all batches consist of different data.
For training neural nets, in general, it is advised to normalize your input. Your input takes values from 1 to 1000. My example applies a simple linear transformation to get the input data in the right range.
Normalization can also apply to the output. If the outputs are large, this can result in (too) large gradients and weight updates. Another approach is to lower the learning rate a lot.
using Flux
using Flux: #epochs
using Random
normalize(x) = x/1000
function generate_data(n)
f(x) = x^2 + x
xs = reduce(hcat, rand(n)*1000)
ys = f.(xs)
(normalize(xs), normalize(ys))
end
batch_size = 32
num_batches = 10000
data_train = Iterators.repeated(generate_data(batch_size), num_batches)
data_test = generate_data(100)
model = Chain(Dense(1,40, relu), Dense(40,40, relu), Dense(40, 1))
loss(x,y) = Flux.mse(model(x), y)
opt = ADAM()
ps = Flux.params(model)
Flux.train!(loss, ps, data_train, opt , cb = () -> #show loss(data_test...))

Cost Function in keras

How do I implement a network cost function of type autoencoder in keras based on the database labels. The examples of this base have labels 0 and 1. I did the form presented below, I do not know if it is correct.
def loss_function (x):
def function(y_true, y_pred):
for i in range (batch_size):
if x[i]==0:
print('valorA',x[i])
L1=K.mean(K.square(y_pred[i] - y_true[i]))
else:
print('valorB',x[i])
L2=K.mean(K.square(y_pred[i] - y_true[i]))
return L1+L2
return function
compilação do autoencoder
autoencoder.compile(loss=loss_function(labeltr),optimizer='adam')

Flink SVM 90% misclassification

I try to do some binary classification with the flink-ml svm implementation.
When I evaluated the classification I got a ~85% error rate on the training dataset. I plotted the 3D data and it looked like you could separate the data quite well with a hyperplane.
When I tried to get the weight vector out of the svm I only saw the option to get the weight vector without the interception of the hyperplane. So just a hyperplane going through (0,0,0).
I don't have any clue where the error could be and appreciate every clue.
val env = ExecutionEnvironment.getExecutionEnvironment
val input: DataSet[(Int, Int, Boolean, Double, Double, Double)] = env.readCsvFile(filepathTraining, ignoreFirstLine = true, fieldDelimiter = ";")
val inputLV = input.map(
t => { LabeledVector({if(t._3) 1.0 else -1.0}, DenseVector(Array(t._4, t._5, t._6)))}
)
val trainTestDataSet = Splitter.trainTestSplit(inputLV, 0.8, precise = true, seed = 100)
val trainLV = trainTestDataSet.training
val testLV = trainTestDataSet.testing
val svm = SVM()
svm.fit(trainLV)
val testVD = testLV.map(lv => (lv.vector, lv.label))
val evalSet = svm.evaluate(testVD)
// groups the data in false negatives, false positives, true negatives, true positives
evalSet.map(t => (t._1, t._2, 1)).groupBy(0,1).reduce((x1,x2) => (x1._1, x1._2, x1._3 + x2._3)).print()
The plotted data is shown here:
The SVM classifier doesn't give you the distance to the origin (aka. bias or threshold), because that's a parameter of the predictor. Different values of the threshold will result in different precision and recall metrics and the optimum is use-case specific. Usually we use a ROC (Receiver Operating Characteristic) curve to find it.
The related properties on SVM are (from the Flink docs):
ThresholdValue - to set the threshold for testing / predicting. Outputs below are classified as negative and outputs above as positive. Default is 0.
OutputDecisionFunction - set this to true to output the distance to the separating plane instead of the binary classification.
ROC Curve
How to find the optimum threshold is an art in itself. Without knowing anything more about the problem, what you can always do is plot the ROC curve (the True Positive Rate against the False Positive Rate) for different values of the threshold and look for the point with the greatest distance from a random guess (the line with 0.5 slope). But ultimately the choice of threshold also depends on the cost of a false positive vs. the cost of a false negative in your domain. Here is an example ROC curve from Wikipedia for three different classifiers:
To choose the initial threshold you could average it over the training data (or a sample of it):
// weights is a DataSet of size 1
val weights = svm.weightsOption.get.collect().head
val initialThreshold = trainLV.map { lv =>
(lv.label - (weights dot lv.vector), 1l)
}.reduce { (avg1, avg2) =>
(avg1._1 + avg2._1, avg1._2 + avg2._2)
}.collect() match { case Seq((sum, len)) =>
sum / len
}
and then vary it in a loop, measuring the TPR and FPR on the test data.
Other Hyperparameters
Note that the SVM trainer also has Parameters (those are called hyperparameters) that need to be tuned for optimal prediction performance. There are many techniques to do that and this post would become too long to list them. I just wanted to bring your attention to that. If you're feeling lazy, here's a link on Wikipedia: Hyperparameter optimization.
Other Dimensions?
There is (somewhat of) a hack if you don't want to deal with the threshold right now. You can jam the bias into another dimension of the feature vector like so:
val bias = 10 // choose a large value
val inputLV = input.map { t =>
LabeledVector(
if (t._3) 1.0 else -1.0,
DenseVector(Array(t._4, t._5, t._6, bias)))
}
Here is a nice discussion on why you should NOT do this. Basically the problem is that the bias would participate in regularization. But in machine learning there are no absolute truths.

I am training a keras neural net. I would like to have a custom loss function given by y_true*y_pred. Is this allowed?

This is a snippet of my model:
W1 = create_base_network(latent_dim)
input_a = Input(shape=(1,latent_dim))
input_b = Input(shape=(1,latent_dim))
x_a = encoder(input_a)
x_b = encoder(input_b)
processed_a = W1(x_a)
processed_b = W1(x_b)
del1 = Lambda(Delta1, output_shape=Delta1_output_shape)([processed_a, processed_b])
model = Model(input=[input_a, input_b], output=del1)
# train
rms = RMSprop()
model.compile(loss='kappa_delta_loss', optimizer=rms)
Basically, the neural net is getting a (pre-trained) encoder representation of the two inputs and computing the difference in prediction values for the two inputs by passing through a MLP. This difference is Delta1 which is y_pred of the network. I want the loss function to be y_pred*y_true. However, when I do that, I get the error, 'Invalid objective: kappa_delta_loss'.
What am I doing wrong?
You almost answer the question yourself. Create your objective
function like ones in
https://github.com/fchollet/keras/blob/master/keras/objectives.py like
this,
import theano import theano.tensor as T
epsilon = 1.0e-9
def custom_objective(y_true, y_pred):
'''Just another crossentropy'''
y_pred = T.clip(y_pred, epsilon, 1.0 - epsilon)
y_pred /= y_pred.sum(axis=-1, keepdims=True)
cce = T.nnet.categorical_crossentropy(y_pred, y_true)
return cce and pass it to compile argument
model.compile(loss=custom_objective, optimizer='adadelta')
from https://github.com/fchollet/keras/issues/369
So you should create your custom loss function with two arguments, the first being the target and the second your prediction.
Assuming your output (y_pred) is a scalar, your custom objective could be
def custom objective(y_true,y_pred)
return K.dot(y_true,y_pred)
K for keras backend (more generic than the theano example)