matconvnet classification training last layer (softmax)? - matlab

I would like to retrain the vgg-imagenet-f network to do classification (rather than direct image comparison, which is what I have done with my own network).
The downloaded network however is a deployment net, and doesn't have a loss layer included. As I've not done classification training before, I'm a bit stumped as to how to design this last layer. I expect it will be something like this:
layer.name = 'loss' ;
layer.type = 'custom' ;
layer.forward = #forward ;
layer.backward = #backward ;
layer.class = [] ;
but I don't know what my #forward and #backward functions should be. Should they be softmax?
Of note, I have a imdb with about 10k images, corresponding labels, and an ID element with unique numbers running 1 - 10k.
Thanks for any help, or any links to a sample of the way one should construct this layer in matconvnet/matlab!

You could implement your own network adjusting the filters accordingly, since you want to 'retrain' vgg instead of initializing the weights with random numbers you can adapt your classification network using trained filers from downloaded network. The last layer could be softmaxloss
http://www.vlfeat.org/matconvnet/mfiles/vl_nnsoftmaxloss/

Related

NSLocalizedDescription = "The size of the output layer 'Identity' in the neural network does not match the number of classes in the classifier."

I just created a model that does a binary classification and has a dense layer of 1 unit at the end. I used Sigmoid activation. However, I get this error now when I wanna convert it to CoreML.
I tried to change the number of units to 2 and activation to softmax but still didn't work.
import coremltools as ct
#1. define input size
image_input = ct.ImageType(scale=1/255)
#2. give classifier
classifier_config = coremltools.ClassifierConfig(class_labels=[0, 1]) #ERROR here
#3. convert the model
coreml_model = coremltools.convert("mask_detection_model_surgical_mask.h5",
inputs=[image_input], classifier_config=classifier_config)
#4. load and resize an example image
example_image = Image.open("Unknown3.jpg").resize((256, 256))
# Make a prediction using Core ML
out_dict = coreml_model.predict({mymodel.input_names[0]: example_image})
print(out_dict["classLabels"])
# save to disk
#coreml_model.save("FINALLY.mlmodel")
I found the answer to my question.
Use Softmax activation and 2 Dense units as the final layer with either loss='binary_crossentropy' or `loss='categorical_crossentropy'
Good luck to hundreds of people who posted a similar question but received no answer.

Interclass and Intraclass classification structure of CNN

I am working on a inter-class and intra-class classification problem with one CNN such as first there is two classes Cat and Dog than in Cat there is a classification three different breeds of cats and in Dog there are 5 different breeds dogs.
I haven't tried the coding yet just working on feasibility if that works.
My question is what will be the feasible design for this kind of problem.
I am thinking to design for the training, first CNN-1 network that will differentiate cat and dog and gather the image data of all the training images. After the separation of cat and dog, CNN-2 and CNN-3 will train these images further for each breed of dog and cat. I am just not sure how the testing will work in this situation.
I have approached a similar problem previously in Python. Hopefully this is helpful and you can come up with an alternative implementation in Matlab if that is what you are using.
After all was said and done, I landed on a single model for all predictions. For your purpose you could have one binary output for dog vs. cat, another multi-class output for the dog breeds, and another multi-class output for the cat breeds.
Using Tensorflow, I created a mask for the irrelevant classes. For example, if the image was of a cat, then all of the dog breeds are irrelevant and they should not impact model training for that example. This required a customized TF Dataset (that converted 0's to -1 for the mask) and a customized loss function that returned 0 error when the mask was present for that example.
Finally for the training process. Specific to your question, you will have to create custom accuracy functions that can handle the mask values how you want them to, but otherwise this part of the process should be standard. It was best practice to evenly spread out the classes among the training data but they can all be trained together.
If you google "Multi-Task Training" you can find additional resources for this problem.
Here are some code snips if you are interested:
For the customize TF dataset that masked irrelevant labels...
# Replace 0's with -1 for mask when there aren't any labels
def produce_mask(features):
for filt, tensor in features.items():
if "target" in filt:
condition = tf.equal(tf.math.reduce_sum(tensor), 0)
features[filt] = tf.where(condition, tf.ones_like(tensor) * -1, tensor)
return features
def create_dataset(filepath, batch_size=10):
...
# **** This is where the mask was applied to the dataset
dataset = dataset.map(produce_mask, num_parallel_calls=cpu_count())
...
return parsed_features
Custom loss function. I was using binary-crossentropy because my problem was multi-label. You will likely want to adapt this to categorical-crossentropy.
# Custom loss function
def masked_binary_crossentropy(y_true, y_pred):
mask = backend.cast(backend.not_equal(y_true, -1), backend.floatx())
return backend.binary_crossentropy(y_true * mask, y_pred * mask)
Then for the custom accuracy metrics. I was using top-k accuracy, you may need to modify for your purposes, but this will give you the general idea. When comparing this to the loss function, instead of converting all to 0, which would over-inflate the accuracy, this function filters those values out entirely. That works because the outputs are measured individually, so each output (binary, cat breed, dog breed) would have a different accuracy measure filtered only to the relevant examples.
backend is keras backend.
def top_5_acc(y_true, y_pred, k=5):
mask = backend.cast(backend.not_equal(y_true, -1), tf.bool)
mask = tf.math.reduce_any(mask, axis=1)
masked_true = tf.boolean_mask(y_true, mask)
masked_pred = tf.boolean_mask(y_pred, mask)
return top_k_categorical_accuracy(masked_true, masked_pred, k)
Edit
No, in the scenario I described above there is only one model and it is trained with all of the data together. There are 3 outputs to the single model. The mask is a major part of this as it allows the network to only adjust weights that are relevant to the example. If the image was a cat, then the dog breed prediction does not result in loss.

implementing a MLP model in keras for timeseries prediction but the model doesn't train well

I'm trying to come up with a MLP model for timeseries prediction following this blog post. I have 138 timeseries with a lookback_window=28 (splitted as 50127 timeseries for traing and 24255 timeseries for validation). I need to predict the next value (timesteps=28, n_features=1). I started from a 3 layer network but it didn't train well. I tried to make the network deeper by adding more layers/more hunits, but it doesn't improve. In the picture, you can see the result of prediction of the following model Here is my model code:
inp = Input(batch_shape=(batch_size, lookback_window))
first_layer = Dense(1000, input_dim=28, activation='relu')(inp)
snd_layer = Dense(500)(first_layer)
thirs_layer = Dense(250)(snd_layer)
tmp = Dense(100)(thirs_layer)
tmp2 = Dense(50)(tmp)
tmp3 = Dense(25)(tmp2)
out = Dense(1)(tmp3)
model = Model(inp, out)
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(train_data, train_y,
epochs=1000,
batch_size=539,
validation_data=(validation_data, validation_y),
verbose=1,
shuffle=False)
What am I missing? How can I improve it?
The main thing I noticed is that you are not using non-linearities in your layers. I would use relus for the hidden layers and linear layer for the final layer in case you want values larger than 1 / -1 to be possible. If you do not want them to be possible use tanh. By increasing the data you make the problem harder and therefore your mostly linear model is underfitting severely.
I managed to get better results by the following changes:
Using RMSprop instead of Adam with lr=0.001, and as #TommasoPasini mentioned added them to all Dense layers (expect the last one). It improves the results a lot!
epochs= 3000 instead of 1000.
But now I think it is overfitting. Here are the plots of the results and the validation and train loss:

Weka Text Classification MultilayerPerceptron

My goal is to test how well a Multilayer Perceptron classifies the 20 newsgroups data. I keep getting only 5% accuracy with this method but can obtain ~90% with other classification methods such as Naive Bayes and KNN. I'm sure I am doing it wrong, so here is my code in hopes that someone can point me in the right direction:
newsgroups_data.setClassIndex(newsgroups_data.numAttributes() - 1);
StringToWordVector filter = new StringToWordVector();
FilteredClassifier classifier = new FilteredClassifier();
classifier.setFilter(filter);
MultilayerPerceptron mlp = new MultilayerPerceptron();
mlp.setTrainingTime(300); //This alone takes an hour or more
mlp.setLearningRate(0.01);
mlp.setHiddenLayers("1");
mlp.setReset(false);
classifier.setClassifier(mlp);
classifier.buildClassifier(newsgroups_data);
Evaluation eval = new Evaluation(newsgroups_data);
mlp.setHiddenLayers("1")
means you want to use one hidden layer with one node in it (that means you're setting up a neural network with ONE total neurons).

Feed Forward - Neural Networks Keras

for my input in the feed forward neural network that I have implemented in Keras, I just wanted to check that my understanding is correct.
[[ 25.26000023 26.37000084 24.67000008 23.30999947]
[ 26.37000084 24.67000008 23.30999947 21.36000061]
[ 24.67000008 23.30999947 21.36000061 19.77000046]...]
So in the data above it is a time window of 4 inputs in an array. My input layer is
model.add(Dense(4, input_dim=4, activation='sigmoid'))
model.fit(trainX, trainY, nb_epoch=10000,verbose=2,batch_size=4)
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch? and does the batch_size need to be 4 in order for this time window to work?
Thanks John
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch?
Yes, each epoch is iteration over all training samples
and does the batch_size need to be 4 in order for this time window to work?
No, these are completely unrelated things. Batch is simply a subset of your training data which is used to compute approximation of the true gradient of the cost function. Bigger the batch - closer you get to the true gradient (and original Gradient Descent), but training gets slower. Closer to 1 you get - it becomes more and more stochastic, noisy approxmation (and closer to Stochastic Gradient Descent). The fact that you matched batch_size and data dimensionality is just an odd-coincidence, and has no meaning.
Let me put this in more generall setting, what you do in gradient descent with additive loss function (which neural nets usually use) is going against the gradient which is
grad_theta 1/N SUM_i=1^N loss(x_i, pred(x_i), y_i|theta) =
= 1/N SUM_i=1^N grad_theta loss(x_i, pred(x_i), y_i|theta)
where loss is some loss function over your pred (prediction) as compared to y_i.
And in batch based scenatio (the rough idea) is that you do not need to go over all examples, but instead some strict subset, like batch = {(x_1, y_1), (x_5, y_5), (x_89, y_89) ... } and use approximation of the gradient of form
1/|batch| SUM_(x_i, y_i) in batch: grad_theta loss(x_i, pred(x_i), y_i|theta)
As you can see this is not related in any sense to the space where x_i live, thus there is no connection with dimensionality of your data.
Let me explain this with an example:
When you have 32 training examples and you call model.fit with a batch_size of 4, the neural network will be presented with 4 examples at a time, but one epoch will still be defined as one complete pass over all 32 examples. So in this case the network will go through 4 examples at a time, and will ,theoretically at least, call the forward pass (and the backward pass) 32 / 4 = 8 times.
In the extreme case when your batch_size is 1, that is plain old stochastic gradient descent. When your batch_size is greater than 1 then it's called batch gradient descent.