How can I save the probability of the prediction in caffe? - neural-network

Does anyone know how can I save the output of predicted class labels of each pixel in FCN semantic segmentation? I would like to see the probability map of the image during inference. The data in which layer should be saved?
Many thanks

As you can see from the code in infer.py the predicted label is argmax of 'score' layer.
out = net.blobs['score'].data[0].argmax(axis=0)
The 'score' is the input to "SoftmaxWithLoss" layer during training. Hence, to get class probabilities from 'score' you need to add "Softmax" on top of 'score':
e_s = np.exp(net.blobs['score'].data[0])
prob = e_s / e_s.sum(axis=0)

Related

Keras input explanation: input_shape, units, batch_size, dim, etc

For any Keras layer (Layer class), can someone explain how to understand the difference between input_shape, units, dim, etc.?
For example the doc says units specify the output shape of a layer.
In the image of the neural net below hidden layer1 has 4 units. Does this directly translate to the units attribute of the Layer object? Or does units in Keras equal the shape of every weight in the hidden layer times the number of units?
In short how does one understand/visualize the attributes of the model - in particular the layers - with the image below?
Units:
The amount of "neurons", or "cells", or whatever the layer has inside it.
It's a property of each layer, and yes, it's related to the output shape (as we will see later). In your picture, except for the input layer, which is conceptually different from other layers, you have:
Hidden layer 1: 4 units (4 neurons)
Hidden layer 2: 4 units
Last layer: 1 unit
Shapes
Shapes are consequences of the model's configuration. Shapes are tuples representing how many elements an array or tensor has in each dimension.
Ex: a shape (30,4,10) means an array or tensor with 3 dimensions, containing 30 elements in the first dimension, 4 in the second and 10 in the third, totaling 30*4*10 = 1200 elements or numbers.
The input shape
What flows between layers are tensors. Tensors can be seen as matrices, with shapes.
In Keras, the input layer itself is not a layer, but a tensor. It's the starting tensor you send to the first hidden layer. This tensor must have the same shape as your training data.
Example: if you have 30 images of 50x50 pixels in RGB (3 channels), the shape of your input data is (30,50,50,3). Then your input layer tensor, must have this shape (see details in the "shapes in keras" section).
Each type of layer requires the input with a certain number of dimensions:
Dense layers require inputs as (batch_size, input_size)
or (batch_size, optional,...,optional, input_size)
2D convolutional layers need inputs as:
if using channels_last: (batch_size, imageside1, imageside2, channels)
if using channels_first: (batch_size, channels, imageside1, imageside2)
1D convolutions and recurrent layers use (batch_size, sequence_length, features)
Details on how to prepare data for recurrent layers
Now, the input shape is the only one you must define, because your model cannot know it. Only you know that, based on your training data.
All the other shapes are calculated automatically based on the units and particularities of each layer.
Relation between shapes and units - The output shape
Given the input shape, all other shapes are results of layers calculations.
The "units" of each layer will define the output shape (the shape of the tensor that is produced by the layer and that will be the input of the next layer).
Each type of layer works in a particular way. Dense layers have output shape based on "units", convolutional layers have output shape based on "filters". But it's always based on some layer property. (See the documentation for what each layer outputs)
Let's show what happens with "Dense" layers, which is the type shown in your graph.
A dense layer has an output shape of (batch_size,units). So, yes, units, the property of the layer, also defines the output shape.
Hidden layer 1: 4 units, output shape: (batch_size,4).
Hidden layer 2: 4 units, output shape: (batch_size,4).
Last layer: 1 unit, output shape: (batch_size,1).
Weights
Weights will be entirely automatically calculated based on the input and the output shapes. Again, each type of layer works in a certain way. But the weights will be a matrix capable of transforming the input shape into the output shape by some mathematical operation.
In a dense layer, weights multiply all inputs. It's a matrix with one column per input and one row per unit, but this is often not important for basic works.
In the image, if each arrow had a multiplication number on it, all numbers together would form the weight matrix.
Shapes in Keras
Earlier, I gave an example of 30 images, 50x50 pixels and 3 channels, having an input shape of (30,50,50,3).
Since the input shape is the only one you need to define, Keras will demand it in the first layer.
But in this definition, Keras ignores the first dimension, which is the batch size. Your model should be able to deal with any batch size, so you define only the other dimensions:
input_shape = (50,50,3)
#regardless of how many images I have, each image has this shape
Optionally, or when it's required by certain kinds of models, you can pass the shape containing the batch size via batch_input_shape=(30,50,50,3) or batch_shape=(30,50,50,3). This limits your training possibilities to this unique batch size, so it should be used only when really required.
Either way you choose, tensors in the model will have the batch dimension.
So, even if you used input_shape=(50,50,3), when keras sends you messages, or when you print the model summary, it will show (None,50,50,3).
The first dimension is the batch size, it's None because it can vary depending on how many examples you give for training. (If you defined the batch size explicitly, then the number you defined will appear instead of None)
Also, in advanced works, when you actually operate directly on the tensors (inside Lambda layers or in the loss function, for instance), the batch size dimension will be there.
So, when defining the input shape, you ignore the batch size: input_shape=(50,50,3)
When doing operations directly on tensors, the shape will be again (30,50,50,3)
When keras sends you a message, the shape will be (None,50,50,3) or (30,50,50,3), depending on what type of message it sends you.
Dim
And in the end, what is dim?
If your input shape has only one dimension, you don't need to give it as a tuple, you give input_dim as a scalar number.
So, in your model, where your input layer has 3 elements, you can use any of these two:
input_shape=(3,) -- The comma is necessary when you have only one dimension
input_dim = 3
But when dealing directly with the tensors, often dim will refer to how many dimensions a tensor has. For instance a tensor with shape (25,10909) has 2 dimensions.
Defining your image in Keras
Keras has two ways of doing it, Sequential models, or the functional API Model. I don't like using the sequential model, later you will have to forget it anyway because you will want models with branches.
PS: here I ignored other aspects, such as activation functions.
With the Sequential model:
from keras.models import Sequential
from keras.layers import *
model = Sequential()
#start from the first hidden layer, since the input is not actually a layer
#but inform the shape of the input, with 3 elements.
model.add(Dense(units=4,input_shape=(3,))) #hidden layer 1 with input
#further layers:
model.add(Dense(units=4)) #hidden layer 2
model.add(Dense(units=1)) #output layer
With the functional API Model:
from keras.models import Model
from keras.layers import *
#Start defining the input tensor:
inpTensor = Input((3,))
#create the layers and pass them the input tensor to get the output tensor:
hidden1Out = Dense(units=4)(inpTensor)
hidden2Out = Dense(units=4)(hidden1Out)
finalOut = Dense(units=1)(hidden2Out)
#define the model's start and end points
model = Model(inpTensor,finalOut)
Shapes of the tensors
Remember you ignore batch sizes when defining layers:
inpTensor: (None,3)
hidden1Out: (None,4)
hidden2Out: (None,4)
finalOut: (None,1)
Input Dimension Clarified:
Not a direct answer, but I just realized that the term "Input Dimension" could be confusing, so be wary:
The word "dimension" alone can refer to:
a) The dimension of Input Data (or stream) such as # N of sensor axes to beam the time series signal, or RGB color channels (3):  suggested term = "Input Stream Dimension"
b) The total number / length of Input Features (or Input layer) (28 x 28 = 784 for the MINST color image) or 3000 in the FFT transformed Spectrum Values, or
"Input Layer / Input Feature Dimension"
c) The dimensionality (# of dimensions) of the input (typically 3D as expected in Keras LSTM) or (# of Rows of Samples, # of Sensors, # of Values..) 3 is the answer.
"N Dimensionality of Input"
d) The SPECIFIC Input Shape (eg. (30,50,50,3) in this unwrapped input image data, or (30, 2500, 3) if unwrapped
Keras:    
In Keras, input_dim refers to the Dimension of Input Layer / Number of Input Features
    model = Sequential()
    model.add(Dense(32, input_dim=784))  #or 3 in the current posted example above
    model.add(Activation('relu'))
In Keras LSTM, it refers to the total Time Steps
The term has been very confusing, we live in a very confusing world!!
I find one of the challenge in Machine Learning is to deal with different languages or dialects and terminologies (like if you have 5-8 highly different versions of English, then you need a very high proficiency to converse with different speakers). Probably this is the same in programming languages too.
Added this answer to elaborate on the input shape at the first layer.
I created tow variation of the same layers
Case 1:
model =Sequential()
model.add(Dense(15, input_shape=(5,3),activation="relu", kernel_initializer="he_uniform", kernel_regularizer=None,kernel_constraint="MaxNorm"))
model.add(Dense(32,activation="relu"))
model.add(Dense(8))
Case 2:
model1=Sequential()
model1.add(Dense(15,input_shape=(15,),kernel_initializer="he_uniform",kernel_constraint="MaxNorm",kernel_regularizer=None,activation="relu"))
model1.add(Dense(32,activation="relu"))
model1.add(Dense(8))
plot_model(model1,show_shapes=True)
Now if plot these and take summary,-
Case 1
[![Case1 Model Summary][2]][2]
[2]: https://i.stack.imgur.com/WXh9z.png
Case 2
summary
Now if you look closely , in the first case , input is two dimensional. Output of first layer generates one for each row x number of units.
Case two is simpler , there is not such complexity each unit produces one output after activation.

Calculating size of output of a Conv layer in CNN model

In convolutional Neural Networks, How to know the output of a specific conv layer? (I am using keras to build a CNN model)
For example if I am using one dimensional conv layer, where number_of_filters=20, kernel_size=10, and input_shape(500,1)
cnn.add(Conv1D(20,kernel_size=10,strides=1, padding="same",activation="sigmoid",input_shape=(Dimension_of_input,1)))
and if I am using two dimensional conv layer, where number_of_filters=64, kernal_size=(5,100), input_shape= (5,720,1) (height,width,channel)
Conv2D(64, (5, 100),
padding="same",
activation="sigmoid",
data_format="channels_last",
input_shape=(5,720,1)
what is the number of output in the above two conv layers? Is there any equation that can be used to know the number of outputs of a conv layer in convolution neural network?
Yes, there are equations for it, you can find them in the CS231N course website. But as this is a programming site, Keras provides an easy way to get this information programmaticaly, by using the summary function of a Model.
model = Sequential()
fill model with layers
model.summary()
This will print in terminal/console all the layer information, such as input shapes, output shapes, and number of parameters for each layer.
Actually, the model.summary() function might not be what you are looking for if you want to do more than just look at the model.
If you want to access layers of your Keras model you can do this by using model.layers which returns all of the layers (assignement stores them as a list). If you then want to look at a specific layer you can simply index the list:
list_of_layers = model.layers
list_of_layers[5] # gives you the 6th layer
What you are still working with are just objects so you probably want to get specific values. You just have to specify attribute you want to look at then:
list_of_layers[-1].output_shape # returns output_shape of last layer
Gives you back the output_shape tuple of the last layer in the model.
You can even skip the whole list assignement thing if you already know that you only want to look at the output_shape of a certain layer and just do:
model.layers[-1].output_shape # equivalent to the above method without storing in a list
This might be useful if you want to use these values while building the model to guide the execution in a certain way (adding a pooling layer or doing the padding etc.).
when first time i am working with TensorFlow cnn it is very difficult to dealing with dimensions. below is the general scenario for calculating dimensions:
consider
we have a image of dimension (nXn), filter dimension : (fXf), no padding, no strides applies :
after convolution dimension are : (n-f+1,n-f+1)
dimension of image = (nXn) and filter dimension = (fXf) and we have padding : p
then output dims are = (n+2P-f+1,n+2P-f+1)
if we are using Padding = 'SAME" it means output dims = input dims in this case equation looks like : n+2P-f+1=n
so from here p = (f-1)/2
if we are using valid padding then it means no padding and p =0
in computer vision f is usually odd if f is even it means we have asymmetric padding.
case when we are using stride = s
output dims are ( floor( ((n+2P-f)/s)+1 ),floor( ( (n+2P-f)/s)+1 ) )

Cutoff on Neural Network regression predictions

Context: I have a set of documents, each of them with two associated probability values: probability to belong to class A or and probability to belong to class B. The classes are mutually exclusive, and the probabilities add up to one. So, for instance document D has probabilities (0.6, 0.4) associated as ground truth.
Each document is represented by the tfidf of the terms that it contains, normalized from 0 to 1. I also tried doc2vec (normalized form -1 to 1) and a couple of other methods.
I built a very simple Neural Network to predict this probability distribution.
Input layer with as many nodes as features
Single hidden layer with one node
Output layer with softmax and two nodes
Cross entropy loss function
I also tried with different update functions and learning rates
This is the code I wrote using nolearn:
net = nolearn.lasagne.NeuralNet(
layers=[('input', layers.InputLayer),
('hidden1', layers.DenseLayer),
('output', layers.DenseLayer)],
input_shape=(None, X_train.shape[1]),
hidden1_num_units=1,
output_num_units=2,
output_nonlinearity=lasagne.nonlinearities.softmax,
objective_loss_function=lasagne.objectives.binary_crossentropy,
max_epochs=50,
on_epoch_finished=[es.EarlyStopping(patience=5, gamma=0.0001)],
regression=True,
update=lasagne.updates.adam,
update_learning_rate=0.001,
verbose=2)
net.fit(X_train, y_train)
y_true, y_pred = y_test, net.predict(X_test)
My problem is: my predictions have a cutoff point and no prediction goes below that point (check the picture to understand what I mean).
This plot shows the difference between the true probability and my predictions. The closer a point is to the red line the better the prediction is. Ideally all the points would lie on the line. How can I solve this and why is this happening?
Edit: actually I solved the problem by simply removing the hidden layer:
net = nolearn.lasagne.NeuralNet(
layers=[('input', layers.InputLayer),
('output', layers.DenseLayer)],
input_shape=(None, X_train.shape[1]),
output_num_units=2,
output_nonlinearity=lasagne.nonlinearities.softmax,
objective_loss_function=lasagne.objectives.binary_crossentropy,
max_epochs=50,
on_epoch_finished=[es.EarlyStopping(patience=5, gamma=0.0001)],
regression=True,
update=lasagne.updates.adam,
update_learning_rate=0.001,
verbose=2)
net.fit(X_train, y_train)
y_true, y_pred = y_test, net.predict(X_test)
But I still fail to understand why I had this problem and why removing the hidden layer solved it. Any ideas?
Here the new plot:
I think your training set output value should be [0,1] or [1,0],
[0.6,0.4] is not suited for softmax/Crossentropy .

Classification using GMM with MATLAB

I'm trying to classify a testset using GMM. I have a trainset (n*4 matrix) with labels {1,2,3}, n means the number of training examples, which have 4 properties. And I also have a testset (m*4) to be classified.
My goal is to have a probability matrix (m*3) for each testing example giving each label P(x_test|labels). Just like soft clustering.
first, I create a GMM with k=9 components over the whole trainset. I know in some papers, the author create a GMM for each label in trainset. But I want to deal with the data from all of the classes.
GMModel = fitgmdist(trainset,k_component,'RegularizationValue',0.1,'Start','plus');
My problem is, I want to confirm the relationship P(component|labels)between components and labels. So I write a code as below, but not sure if it's right,
idx_ex_of_c1 = find(trainset_label==1);
idx_ex_of_c2 = find(trainset_label==2);
idx_ex_of_c3 = find(trainset_label==3);
[~,~,post] = cluster(GMModel,trainset);
cita_c_k = zeros(3,k_component);
for id_k = 1:k_component
cita_c_k(1,id_k) = sum(post(idx_ex_of_c1,id_k))/numel(idx_ex_of_c1);
cita_c_k(2,id_k) = sum(post(idx_ex_of_c2,id_k))/numel(idx_ex_of_c2);
cita_c_k(3,id_k) = sum(post(idx_ex_of_c3,id_k))/numel(idx_ex_of_c3);
end
cita_c_k is a (3*9) matrix to store the relationships. idx_ex_of_c1 is the index of examples, whose label is '1' in the trainset.
For the testing process. I first apply the GMModel to testset
[P,~] = posterior(GMModel,testset); % P is a m*9 matrix
And then, sum all components,
P_testset = P*cita_c_k';
[a,b] = max(P_testset,3);
imagesc(b);
The result is ok, But not good enough. Can anyone give me some tips?
Thanks!
You can take following steps:
Increase target error and/or use optimal network size in training, but over-training and network size increase usually won't help
Most important, shuffle training data while training and use only important data points for a label to train (ignore data points that may belong to more than one labels)
SEPARABILITY
Verify separability of data using properties using correlation.
Correlation of all data in a label (X) should be high (near to one)
Cross-correlation of all data in label (X) with data in label (!=X) should be low (near zero).
If you observe that data points in a label have low correlation and data points across labels have high correlation - It puts a question on selection of properties (there could be properties which actually won't make data separable). Being so do follows:
Add more relevant properties to data points and remove less relevant properties (technique to use this is PCA)
Use derived parameters like top frequency component etc. from data points to train rather than direct points
Use a time delay network to train time series (always)

how to calculate roc curves?

I write a classifier (Gaussian Mixture Model) to classify five human actions. For every observation the classifier compute the posterior probability to belong to a cluster.
I want to valutate the performance of my system parameterized with a threshold, with values from 0 to 100. For every threshold values, for every observation, if the probability of belonging to one of cluster is greater than threshold I accept the result of the classifier otherwise I discard it.
For every threshold values I compute the number of true-positive, true-negative, false-positive, false-negative.
Than I compute the two function: sensitivity and specificity as
sensitivity = TP/(TP+FN);
specificity=TN/(TN+FP);
In matlab:
plot(1-specificity,sensitivity);
to have the ROC curve. But the result isn't what I expect.
This is the plot of the functions of discards, errors, corrects, sensitivity and specificity varying the threshold of one action.
This is the plot of ROC curve of one action
This is the stem of ROC curve for the same action
I am wrong, but i don't know where. Perhaps I do wrong the calculating of FP, FN, TP, TN especially when the result of the classifier is minor of the threshold, so I have a discard. What I have to incremente when there is a discard?
Background
I am answering this because I need to work through the content, and a question like this is a great excuse. Thank you for the good opportunity.
I use data from the built-in fisher iris data:
http://archive.ics.uci.edu/ml/datasets/Iris
I also use code snippets from the Mathworks tutorial on the classification, and for plotroc
http://www.mathworks.com/products/demos/statistics/classdemo.html
http://www.mathworks.com/help/nnet/ref/plotroc.html?searchHighlight=plotroc
Problem Description
There is clearer boundary within the domain to classify "setosa" but there is overlap for "versicoloir" vs. "virginica". This is a two dimensional plot, and some of the other information has been discarded to produce it. The ambiguity in the classification boundaries is a useful thing in this case.
%load data
load fisheriris
%show raw data
figure(1); clf
gscatter(meas(:,1), meas(:,2), species,'rgb','osd');
xlabel('Sepal length');
ylabel('Sepal width');
axis equal
axis tight
title('Raw Data')
Analysis
Lets say that we want to determine the bounds for a linear classifier that defines "virginica" versus "non-virginica". We could look at "self vs. not-self" for other classes, but they would have their own
So now we make some linear discriminants and plot the ROC for them:
%load data
load fisheriris
load iris_dataset
irisInputs=meas(:,1:2)';
irisTargets=irisTargets(3,:);
ldaClass1 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'linear')';
ldaClass2 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diaglinear')';
ldaClass3 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'quadratic')';
ldaClass4 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diagquadratic')';
ldaClass5 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'mahalanobis')';
myinput=repmat(irisTargets,5,1);
myoutput=[ldaClass1;ldaClass2;ldaClass3;ldaClass4;ldaClass5];
whos
plotroc(myinput,myoutput)
The result is shown in the following, though it took deleting repeat copies of the diagonal:
You can note in the code that I stack "myinput" and "myoutput" and feed them as inputs into the "plotroc" function. You should take the results of your classifier as targets and actuals and you can get similar results. This compares the actual output of your classifier versus the ideal output of your target values. Those are the input to plotroc.
So this will give you "built-in" ROC, which is useful for quick work, but does not make you learn every step in detail.
Questions you can ask at this point include:
which classifier is best? How do I determine what best is in this case?
What is the convex hull of the classifiers? Is there some mixture of classifiers that is more informative than any pure method? Bagging perhaps?
You are trying to draw the curves of precision vs recall, depending on the classifier threshold parameter. The definition of precision and recall are:
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
You can check the definition of these parameters in:
http://en.wikipedia.org/wiki/Precision_and_recall
There are some curves here:
http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
Are you dividing your dataset in training set, cross validation set and test set? (if you do not divide the data, it is normal that your precision-recall curve seems weird)
EDITED: I think that there are two possible sources for your problem:
When you train a classifier for 5 classes, usually you have to train 5 distinctive classifiers. One classifier for (class A = class 1, class B = class 2, 3, 4 or 5), then a second classfier for (class A = class 2, class B = class 1, 3, 4 or 5), ... and the fifth for class A = class 5, class B = class 1, 2, 3 or 4).
As you said to select the output for your "compound" classifier, you have to pass your new (test) datapoint through the five classifiers, and you choose the one with the biggest probability.
Then, you should have 5 thresholds to define weighting values that my prioritize selecting one classifier over the others. You should check how the matlab implementations uses the thresholds, but their effect is that you don't choose the class with more probability, but the class with better weighted probability.
As you say, maybe you are not calculating well TP, TN, FP, FN. Your test data should have datapoints belonging to all the classes. Then you have testdata(i,:) and classtestdata(i) are the feature vector and "ground truth" class of datapoint i. When you evaluate the classifier you obtain classifierOutput(i) = 1 or 2 or 3 or 4 or 5. Then you should calculate the "confusion matrix", which is the way to calculate TP, TN, FP, FN when you have multiple classes (> 2):
http://en.wikipedia.org/wiki/Confusion_matrix
http://www.mathworks.com/help/stats/confusionmat.html
(note the relation between TP, TN, FP, FN that you are calculating for the multiclass problem)
I think that you can obtain the TP, TN, FP, FN data of each subclassifier (remember that you are calculating 5 separate classifiers, even if you do not realize it) from the confusion matrix. I am not sure but you can draw the precision recall curve for each subclassifier.
Also check these slides: http://www.slideserve.com/MikeCarlo/multi-class-and-structured-classification
I don't know what the ROC curve is, I will check it because machine learning is a really interesting subject for me.
Hope this helps,