How the softmax function and loss function work in multiple input keras model - tf.keras

My model has two inputs branches, input1 2D grayscale images, and input2 color images. The two inputs branches are merged using the concatenate method and classified using a softmax function. The model is working fine but the problem is in understanding the operation of softmax in multiple inputs model and also how weights are updated in both the branches.

The softmax function and loss function perform similar to the single input/output model even in the case of multiple input/output model. If we only passed a single loss function to the model, the same loss function would be applied to every output, unless you specify different loss function and different activation function for multiple outputs.
Consider the following model, which has an image input of shape (32, 32, 3) (that's (height, width, channels)) and a timeseries input of shape (None, 10) (that's (timesteps, features)). Our model will have two outputs computed from the combination of these inputs: a "score" (of shape (1,)) and a probability distribution over five classes (of shape (5,)).
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
image_input = keras.Input(shape=(32, 32, 3), name='img_input')
timeseries_input = keras.Input(shape=(None, 10), name='ts_input')
x1 = layers.Conv2D(3, 3)(image_input)
x1 = layers.GlobalMaxPooling2D()(x1)
x2 = layers.Conv1D(3, 3)(timeseries_input)
x2 = layers.GlobalMaxPooling1D()(x2)
x = layers.concatenate([x1, x2])
score_output = layers.Dense(1, name='score_output')(x)
class_output = layers.Dense(5, name='class_output')(x)
model = keras.Model(inputs=[image_input, timeseries_input],
outputs=[score_output, class_output])
Let's plot this model, so you can clearly see what we're doing here (note that the shapes shown in the plot are batch shapes, rather than per-sample shapes).
keras.utils.plot_model(model, 'multi_input_and_output_model.png', show_shapes=True)
If we only passed a single loss function to the model, the same loss function would be applied to every output, which is not appropriate here.
Passing data to a multi-input or multi-output model in fit works in a similar way as specifying a loss function in compile: you can pass lists of Numpy arrays (with 1:1 mapping to the outputs that received a loss function) or dicts mapping output names to Numpy arrays of training data.
model.compile(
optimizer=keras.optimizers.RMSprop(1e-3),
loss=[keras.losses.MeanSquaredError(),
keras.losses.CategoricalCrossentropy(from_logits=True)])
# Generate dummy Numpy data
img_data = np.random.random_sample(size=(100, 32, 32, 3))
ts_data = np.random.random_sample(size=(100, 20, 10))
score_targets = np.random.random_sample(size=(100, 1))
class_targets = np.random.random_sample(size=(100, 5))
# Fit on lists
model.fit([img_data, ts_data], [score_targets, class_targets],
batch_size=32,
epochs=3)
# Alternatively, fit on dicts
model.fit({'img_input': img_data, 'ts_input': ts_data},
{'score_output': score_targets, 'class_output': class_targets},
batch_size=32,
epochs=3)
Output -
Train on 100 samples
Epoch 1/3
100/100 [==============================] - 2s 22ms/sample - loss: 5.2477 - score_output_loss: 0.1809 - class_output_loss: 5.3292
Epoch 2/3
100/100 [==============================] - 0s 191us/sample - loss: 4.8558 - score_output_loss: 0.1235 - class_output_loss: 4.5884
Epoch 3/3
100/100 [==============================] - 0s 202us/sample - loss: 4.7482 - score_output_loss: 0.1421 - class_output_loss: 4.5786
Train on 100 samples
Epoch 1/3
100/100 [==============================] - 0s 260us/sample - loss: 4.6704 - score_output_loss: 0.1377 - class_output_loss: 4.5686
Epoch 2/3
100/100 [==============================] - 0s 204us/sample - loss: 4.6210 - score_output_loss: 0.2038 - class_output_loss: 4.5260
Epoch 3/3
100/100 [==============================] - 0s 188us/sample - loss: 4.6014 - score_output_loss: 0.1678 - class_output_loss: 4.3346
<tensorflow.python.keras.callbacks.History at 0x7f8cc43a9ac8>

Related

Why is my val_accuracy stagnant at 0.0000e+00 while my val_loss is increasing from the start?

I am training a classification model to classify cells, and my model is based on this paper: https://www.nature.com/articles/s41598-019-50010-9. As my dataset consists of only 10 images, I performed image augmentation to artificially increase the size of the dataset to 3000 images, which were then split into 2400 training images and 600 validation images.
However, while the training loss and accuracy improved upon more iterations, the validation loss increased rapidly while validation accuracy remained stagnant at 0.0000e+00.
Is my model overfitting severely right from the start?
The code I used is as shown below:
import keras
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model, load_model, Sequential, model_from_json, load_model
from tensorflow.keras.layers import Input, BatchNormalization, Activation, Flatten, Dense, LeakyReLU
from tensorflow.python.keras.layers.core import Lambda, Dropout
from tensorflow.python.keras.layers.convolutional import Conv2D, Conv2DTranspose, UpSampling2D
from tensorflow.python.keras.layers.pooling import MaxPooling2D, AveragePooling2D
from tensorflow.python.keras.layers.merge import Concatenate, Add
from tensorflow.keras import backend as K
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler, ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.optimizers import *
img_channel = 1
input_size = (512, 512, 1)
inputs = Input(shape = input_size)
initial_input = Lambda(lambda x: x) (inputs) #Ensure input value is between 0 and 1 to avoid negative loss
kernel_size = (3,3)
pad = 'same'
model = Sequential()
filters = 2
c1 = Conv2D(filters, kernel_size, padding = pad, kernel_initializer = 'he_normal')(initial_input)
b1 = BatchNormalization()(c1)
a1 = Activation('elu')(b1)
p1 = AveragePooling2D()(a1)
c2 = Conv2D(filters, kernel_size, padding = pad, kernel_initializer = 'he_normal')(p1)
b2 = BatchNormalization()(c2)
a2 = Activation('elu')(b2)
p2 = AveragePooling2D()(a2)
c3 = Conv2D(filters, kernel_size, padding = pad, kernel_initializer = 'he_normal')(p2)
b3 = BatchNormalization()(c3)
a3 = Activation('elu')(b3)
p3 = AveragePooling2D()(a3)
c4 = Conv2D(filters, kernel_size, padding = pad, kernel_initializer = 'he_normal')(p3)
b4 = BatchNormalization()(c4)
a4 = Activation('elu')(b4)
p4 = AveragePooling2D()(a4)
c5 = Conv2D(filters, kernel_size, padding = pad, kernel_initializer = 'he_normal')(p4)
b5 = BatchNormalization()(c5)
a5 = Activation('elu')(b5)
p5 = AveragePooling2D()(a5)
f = Flatten()(p5)
d1 = Dense(128, activation = 'elu')(f)
d2 = Dense(no_of_img, activation = 'softmax')(d1)
model = Model(inputs = [inputs], outputs = [d2])
print(model.summary())
learning_rate = 0.001
decay_rate = 0.0001
model.compile(optimizer = SGD(lr = learning_rate, decay = decay_rate, momentum = 0.9, nesterov = False),
loss = 'categorical_crossentropy', metrics = ['accuracy'])
perf_lr_scheduler = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.9, patience = 3,
verbose = 1, min_delta = 0.01, min_lr = 0.000001)
model_earlystop = EarlyStopping(monitor = 'val_loss', min_delta = 0.001, patience = 10, restore_best_weights = True)
#Convert labels to binary matrics
img_aug_label = to_categorical(img_aug_label, num_classes = no_of_img)
#Convert images to float to between 0 and 1
img_aug = np.float32(img_aug)/255
plt.imshow(img_aug[0,:,:,0])
plt.show()
#Train on augmented images
model.fit(
img_aug,
img_aug_label,
batch_size = 4,
epochs = 100,
validation_split = 0.2,
shuffle = True,
callbacks = [perf_lr_scheduler],
verbose = 2)
The output of my model is as shown below:
Train on 2400 samples, validate on 600 samples
Epoch 1/100
2400/2400 - 12s - loss: 0.6474 - accuracy: 0.8071 - val_loss: 9.8161 - val_accuracy: 0.0000e+00
Epoch 2/100
2400/2400 - 10s - loss: 0.0306 - accuracy: 0.9921 - val_loss: 10.1733 - val_accuracy: 0.0000e+00
Epoch 3/100
2400/2400 - 10s - loss: 0.0058 - accuracy: 0.9996 - val_loss: 10.9820 - val_accuracy: 0.0000e+00
Epoch 4/100
Epoch 00004: ReduceLROnPlateau reducing learning rate to 0.0009000000427477062.
2400/2400 - 10s - loss: 0.0019 - accuracy: 1.0000 - val_loss: 11.3029 - val_accuracy: 0.0000e+00
Epoch 5/100
2400/2400 - 10s - loss: 0.0042 - accuracy: 0.9992 - val_loss: 11.9037 - val_accuracy: 0.0000e+00
Epoch 6/100
2400/2400 - 10s - loss: 0.0024 - accuracy: 0.9996 - val_loss: 11.5218 - val_accuracy: 0.0000e+00
Epoch 7/100
Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.0008100000384729356.
2400/2400 - 10s - loss: 9.9053e-04 - accuracy: 1.0000 - val_loss: 11.7658 - val_accuracy: 0.0000e+00
Epoch 8/100
2400/2400 - 10s - loss: 0.0011 - accuracy: 1.0000 - val_loss: 12.0437 - val_accuracy: 0.0000e+00
Epoch 9/100
I realised the error occured as I had not shuffled the data manually before using it as training data for the model. I thought the validation_split and shuffle arguments would only occur during training time, but in fact this happened before training time. In other words, the fit function will split your data into training and validation sets first, before shuffling the data in each set (but not across).
For my augmented dataset, the split had occurred in a position where the validation set contained types of images not found in the training set. Consequently, the model was performing validation on types of data that it had not seen in the training set, resulting in the poor validation loss and accuracy. Manually shuffling the data before the fitting them into the model solved this problem.

Lower accuracy of VGG16 using data augmentation in Keras

I have a question regarding feature extraction with data augmentation in Keras. I am building a dog breed classifier.
By feature extraction, I am referring to extending the model, (conv_base, VGG16) by adding Dense layers on top, and running the whole thing end to end on the input data. This will allow me to use data augmentation, because every input image goes through the convolutional base every time it’s seen by the model.
Training Set: 6680 images belonging to 133 classes
Validation Set: 835 images belonging to 133 classes
Test Set: 836 images belonging to 133 classes
I was able to successfully implement data augmentation and feature extraction independently of one another but when I try combining the 2, my accuracy is coming out incredibly small for some reason. Why is this? Am I doing something majorly wrong with my approach?
from keras.applications import VGG16
from keras.preprocessing.image import ImageDataGenerator
conv_base = VGG16(weights='imagenet',
include_top=False,
input_shape=(224, 224, 3))
model = Sequential()
model.add(conv_base)
conv_base.trainable = False
model.add(GlobalAveragePooling2D())
model.add(Dense(133, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
train_datagen_aug = ImageDataGenerator(
rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,)
test_datagen_aug = ImageDataGenerator(rescale=1./255)
train_generator_aug = train_datagen_aug.flow_from_directory(
'myImages/train',
target_size=(224, 224),
batch_size=50,
class_mode='categorical')
validation_generator_aug = test_datagen_aug.flow_from_directory(
'myImages/valid',
target_size=(224, 224),
batch_size=32,
class_mode='categorical')
checkpointer_aug = ModelCheckpoint(filepath='saved_models/dogs_transfer_aug_model.h5',
save_best_only=True)
history = model.fit_generator(
train_generator_aug,
steps_per_epoch=130,
epochs=20,
validation_data=validation_generator_aug,
verbose=1,
callbacks=[checkpointer_aug],
validation_steps=26)
Output looks like this:
Epoch 1/20
130/130 [==============================] - 293s - loss: 15.9044 - acc: 0.0083 - val_loss: 16.0019 - val_acc: 0.0072
Epoch 2/20
130/130 [==============================] - 281s - loss: 15.9972 - acc: 0.0075 - val_loss: 15.9977 - val_acc: 0.0075
Epoch 3/20
130/130 [==============================] - 280s - loss: 16.0220 - acc: 0.0060 - val_loss: 15.9977 - val_acc: 0.0075
Epoch 4/20
130/130 [==============================] - 280s - loss: 15.9941 - acc: 0.0077 - val_loss: 16.0019 - val_acc: 0.0072
I recommend it is model over fitting issue as shown in model's loss and accuracy. We can have a go with a smaller version (with reduced layers) of VGG16
from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import MaxPooling2D, ZeroPadding2D
NUMBER_OF_TRAINING_SAMPLES = 6668
NUMBER_OF_VALIDATION_SAMPLES = 835 # let's say you have 400 dogs and 400 cats
batch_size = 32
out_classes = 133
input_shape=(224, 224, 3)
def buildSmallVGG(out_classes, input_shape):
model = Sequential()
model.add(ZeroPadding2D((1,1),input_shape=input_shape))
model.add(Conv2D(16, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(out_classes, activation='softmax'))
return model
model = buildSmallVGG(out_classes, input_shape)
history = model.fit_generator(
train_generator_aug,
steps_per_epoch=NUMBER_OF_TRAINING_SAMPLES // batch_size,
epochs=20,
validation_data=validation_generator_aug,
callbacks=[checkpointer_aug],
validation_steps=NUMBER_OF_VALIDATION_SAMPLES // batch_size)
The above is not tested. Would be good if you could share your results on what you get in loss, accuracy etc.

inconsistent results between Keras and MLPClassifier sklearn

I have been reading Keras documentation to build my own MLP network that implements MLP backpropagation. I am familiar with MLPClassifier in sklearn but I want to learn Keras for deep learning. The following is the first attempt. The network has 3 layers of 1 input (features=64), 1 output and 1 hidden. The total is (64,64,1). The input is numpy matrix X of 125K samples (64 dim) and y is a 1D numpy binary class (1, -1):
# Keras imports
from keras.models import Sequential
from sklearn.model_selection import train_test_split
from keras.layers import Dense, Dropout, Activation
from keras.initializers import RandomNormal, VarianceScaling, RandomUniform
from keras.optimizers import SGD, Adam, Nadam, RMSprop
# System imports
import sys
import os
import numpy as np
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
def train_model(X, y, num_streams, num_stages):
'''
STEP1: Initialize the Model
'''
tr_X, ts_X, tr_y, ts_y = train_test_split(X, y, train_size=.8)
model = initialize_model(num_streams, num_stages)
'''
STEP2: Train the Model
'''
model.compile(loss='binary_crossentropy',
optimizer=Adam(lr=1e-3),
metrics=['accuracy'])
model.fit(tr_X, tr_y,
validation_data=(ts_X, ts_y),
epochs=3,
batch_size=200)
def initialize_model(num_streams, num_stages):
model = Sequential()
hidden_units = 2 ** (num_streams + 1)
# init = VarianceScaling(scale=5.0, mode='fan_in', distribution='normal')
init_bound1 = np.sqrt(3.5 / ((num_stages + 1) + num_stages))
init_bound2 = np.sqrt(3.5 / ((num_stages + 1) + hidden_units))
init_bound3 = np.sqrt(3.5 / (hidden_units + 1))
# drop_out = np.random.uniform(0, 1, 3)
# This is the input layer (that's why you have to state input_dim value)
model.add(Dense(num_stages,
input_dim=num_stages,
activation='relu',
kernel_initializer=RandomUniform(minval=-init_bound1, maxval=init_bound1)))
model.add(Dense(hidden_units,
activation='relu',
kernel_initializer=RandomUniform(minval=-init_bound2, maxval=init_bound2)))
# model.add(Dropout(drop_out[1]))
# This is the output layer
model.add(Dense(1,
activation='sigmoid',
kernel_initializer=RandomUniform(minval=-init_bound3, maxval=init_bound3)))
return model
The problem is that I get 99% accuracy with the same dataset X and y when using MLPClassifier sklearn. However, Keras gives poor accuracy as seen below:
Train on 100000 samples, validate on 25000 samples
Epoch 1/3
100000/100000 [==============================] - 1s - loss: -0.5358 - acc: 0.0022 - val_loss: -0.7322 - val_acc: 0.0000e+00
Epoch 2/3
100000/100000 [==============================] - 1s - loss: -0.6353 - acc: 0.0000e+00 - val_loss: -0.7385 - val_acc: 0.0000e+00
Epoch 3/3
100000/100000 [==============================] - 1s - loss: -0.7720 - acc: 9.0000e-05 - val_loss: -0.9474 - val_acc: 5.2000e-04
I don't understand why? Am I missing something here? Any help is appreciated.
I think the problem is that you are using a sigmoid output layer (bound to [0, 1]) but your classes are (1, -1), you need to change your output values or use tanh.
Also keras layers may have different default parameters than sklearn, make sure you take a look at those in the documentation.
One last thing, for the kernel_initializer try glorot_uniform, it is a good default.
check by converting your labeled data to one hot code before training the model.
For more info on why one hot code check out https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

Keras giving unexpected output for simple binary classification

Here is a simple keras neural network that attempts to map 1->1 and 2->0 (binary classification)
X = [[1] , [2]]
Y = [[1] , [0]]
from keras.callbacks import History
history = History()
from keras import optimizers
inputDim = len(X[0])
print('input dim' , inputDim)
model = Sequential()
model.add(Dense(1, activation='sigmoid', input_dim=inputDim))
model.add(Dense(1, activation='sigmoid'))
sgd = optimizers.SGD(lr=0.009, decay=1e-10, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd , metrics=['accuracy'])
model.fit(X,Y , validation_split=0.1 , verbose=2 , callbacks=[history] , epochs=20,batch_size=32)
Using SGD optimizer :
optimizers.SGD(lr=0.009, decay=1e-10, momentum=0.9, nesterov=True)
Output for epoch 20 :
Epoch 20/20
0s - loss: 0.5973 - acc: 1.0000 - val_loss: 0.4559 - val_acc: 0.0000e+00
If I use the adam optomizer :
sgd = optimizers.adam(lr=0.009, decay=1e-10)
Output for epoch 20 :
Epoch 20/20
0s - loss: 1.2140 - acc: 0.0000e+00 - val_loss: 0.2930 - val_acc: 1.0000
Switching between adam and sgd optimizers appears to reverse values for acc and val_acc . val_acc = 1 using adam but as acc is 0 , how can validation accuracy be at maximum and training accuracy be at minimum ?
Using sigmoid after sigmoid is a really bad idea. E.g. in this paper it's written why sigmoid suffers from a so-called saturation problem. Moreover - when you use sigmoid after sigmoid you push the overall saturation of your network to by sky-rocketing in fact. To understand why - notice that the output from a first layer is always from an interval (0, 1). As binary_crossentropy tries to make this output (transformed as linear transformation) as close to +/- inf as possible this makes your network to have extremely high weights. This is probably causing your total instability.
In order to solve your problem, I would simply leave only one layer with sigmoid as your problem has a linear separation property.
UPDATE:
As #Daniel mentioned - when you split your dataset containing two examples you end-up having one example in a dataset and other in a validation set. This is causing this weird behavior.

Simple tensorflow neural network not increasing accuracy or decreasing loss?

I have the following network for training,
graph = tf.Graph()
with graph.as_default():
tf_train_dataset = tf.constant(X_train)
tf_train_labels = tf.constant(y_train)
tf_valid_dataset = tf.constant(X_test)
weights = tf.Variable(tf.truncated_normal([X_train.shape[1], 1]))
biases = tf.Variable(tf.zeros([num_labels]))
logits = tf.nn.softmax(tf.matmul(tf_train_dataset, weights) + biases)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
and I ran it as follows,
num_steps = 10
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print('Initialized')
for step in range(num_steps):
_, l, predictions = session.run([optimizer, loss, train_prediction])
print("Loss: ",l)
print('Training accuracy: %.1f' % sklearn.metrics.accuracy_score(predictions.flatten(), y_train.flatten()))
But it outputes as follows
Initialized
Loss: 0.0
Training accuracy: 0.5
Loss: 0.0
Training accuracy: 0.5
The shape of X_train is (213403, 25) and y_train is (213403, 1) to cope up with the logits. I didn't encode the the labels as one hot because there are only two classes , either 1 or 0. I also tried with quadratic loss function and it was still the same, and same thing happened, the loss function doesn't decrease at all. I am sensing a syntactical mistake here but I am clueless.
Your are passing a labels as a single column(without encoding).
Model is unable to get labels as factor type.
So it considers your labels as continuous value.
Loss: 0.0 means loss is zero. That means your model is perfectly fit.
This is happening because your labels are continuous(regression function) and you are using softmax_cross_entropy_with_logits loss function.
Try passing one hot encoding of labels and check.