How to determine accuracy with triplet loss in a convolutional neural network - neural-network

A Triplet network (inspired by "Siamese network") is comprised of 3 instances of the same feed-forward network (with shared parameters). When fed with 3 samples, the network outputs 2 intermediate values - the L2 (Euclidean) distances between the embedded representation of two of its inputs from
the representation of the third.
I'm using pairs of three images for feeding the network (x = anchor image, a standard image, x+ = positive image, an image containing the same object as x - actually, x+ is same class as x, and x- = negative image, an image with different class than x.
I'm using the triplet loss cost function described here.
How do I determine the network's accuracy?

I am assuming that your are doing work for image retrieval or similar tasks.
You should first generate some triplet, either randomly or using some hard (semi-hard) negative mining method. Then you split your triplet into train and validation set.
If you do it this way, then you can define your validation accuracy as proportion of the number of triplet in which feature distance between anchor and positive is less than that between anchor and negative in your validation triplet. You can see an example here which is written in PyTorch.
As another way, you can directly measure in term of your final testing metric. For example, for image retrieval, typically, we measure the performance of model on test set using mean average precision. If you use this metric, you should first define some queries on your validation set and their corresponding ground truth image.
Either of the above two metric is fine. Choose whatever you think fit your case.

So I am performing a similar task of using Triplet loss for classification. Here is how I used the novel loss method with a classifier.
First, train your model using the standard triplet loss function for N epochs. Once you are sure that the model ( we shall refer to this as the embedding generator) is trained, save the weights as we shall be using these weights ahead.
Let's say that your embedding generator is defined as:
class EmbeddingNetwork(nn.Module):
def __init__(self):
super(EmbeddingNetwork, self).__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(1, 64, (7,7), stride=(2,2), padding=(3,3)),
nn.BatchNorm2d(64),
nn.LeakyReLU(0.001),
nn.MaxPool2d((3, 3), 2, padding=(1,1))
)
self.conv2 = nn.Sequential(
nn.Conv2d(64,64,(1,1), stride=(1,1)),
nn.BatchNorm2d(64),
nn.LeakyReLU(0.001),
nn.Conv2d(64,192, (3,3), stride=(1,1), padding=(1,1)),
nn.BatchNorm2d(192),
nn.LeakyReLU(0.001),
nn.MaxPool2d((3,3),2, padding=(1,1))
)
self.fullyConnected = nn.Sequential(
nn.Linear(7*7*256,32*128),
nn.BatchNorm1d(32*128),
nn.LeakyReLU(0.001),
nn.Linear(32*128,128)
)
def forward(self,x):
x = self.conv1(x)
x = self.conv2(x)
x = self.fullyConnected(x)
return torch.nn.functional.normalize(x, p=2, dim=-1)
Now we shall using this embedding generator to create another classifier, fit the weights we saved before to this part of the network and then freeze this part so our classifier trainer does not interfere with the triplet model. This can be done as:
class classifierNet(nn.Module):
def __init__(self, EmbeddingNet):
super(classifierNet, self).__init__()
self.embeddingLayer = EmbeddingNet
self.classifierLayer = nn.Linear(128,62)
self.dropout = nn.Dropout(0.5)
def forward(self, x):
x = self.dropout(self.embeddingLayer(x))
x = self.classifierLayer(x)
return F.log_softmax(x, dim=1)
Now we shall load the weights we saved before and freeze them using:
embeddingNetwork = EmbeddingNetwork().to(device)
embeddingNetwork.load_state_dict(torch.load('embeddingNetwork.pt'))
classifierNetwork = classifierNet(embeddingNetwork)
Now train this classifier network using the standard classification losses like BinaryCrossEntropy or CrossEntropy.

Related

How to get the gradient of the specific output of the neural network to the network parameters

I am building a Bayesian neural network, and I need to manually calculate the gradient of each neural network output and update the network parameters.
For example, in the following network, how can I get the gradient of neural network output ag and bg to the neural network parameters phi, it's --∂ag/∂phi and ∂bg/∂phi--, and update the parameters respectively.
class encoder(torch.nn.Module):
def __init__(self, _l_dim, _hidden_dim, _fg_dim):
super(encoder, self).__init__()
self.hidden_nn = nn.Linear(_l_dim, _hidden_dim)
self.ag_nn = nn.Linear(_hidden_dim, _fg_dim)
self.bg_nn = nn.Linear(_hidden_dim, _fg_dim)
def forward(self, _lg):
ag = self.ag_nn(self.hidden_nn(_lg))
bg = self.bg_nn(self.hidden_nn(_lg))
return ag, bg
If you want do compute dx/dW, you can use autograd for that. torch.autograd.grad(x, W, grad_outputs=torch.ones_like(x), retain_graph=True). Does that actually accomplish what you're trying to do?
Problem statement
You are looking to compute the gradients of the parameters corresponding to each loss term. Given a model f, parametrized by θ_ag and θ_bg. These two parameter sets might overlap: that's the case here since you have a shared hidden layer. Then f(x; θ_ag, θ_bg) will output a pair of elements ag and bg. Your loss function is defined as L = L_ag + L_bg.
The terms you want to compute are dL_ag/dθ_ag and dL_bg/dθ_bg, which is different from what you would typically get with a single backward call: which gives dL/dθ_ag and dL/dθ_bg.
Implementation
In order to compute those terms, you will require two backward passes, after both of them we will compute the respective terms. Before starting, here are a couple things you need to do:
It will be useful to make θ_ag and θ_bg available to us. You can, for example, add those two functions in your model definition:
def ag_params(self):
return [*self.hidden_nn.parameters(), *self.ag_nn.parameters()]
def bg_params(self):
return [*self.hidden_nn.parameters(), *self.bg_nn.parameters()]
Assuming you have a loss function loss_fn which outputs two scalar values L_ab and L_bg. Here is a mockup for loss_fn:
def loss_fn(ab, bg):
return ab.mean(), bg.mean()
We will need an optimizer to zero the gradient out, here SGD:
optim = torch.optim.SGD(model.parameters(), lr=1e-3)
Then we can start applying the following method:
Do an inference to compute ag, and bg as well as L_ag, and L_bg:
>>> ag, bg = model(x)
>>> L_ag, L_bg = loss_fn(ag, bg)
Backpropagate once on L_ag, while retaining the graph:
>>> L_ag.backward(retain_graph=True)
At this point, we can collect dL_ag/dθ_ag on the parameters contained in θ_ag. For example, you could pick the norm of the different parameter gradients using the ag_params function:
>>> pgrad_ag = torch.stack([p.grad.norm()
for p in m.ag_params() if p.grad is not None])
Next we can proceed with a second backpropagation, this time on L_bg. But before that, we need to clear the gradients so dL_ag/dθ_ag doesn't pollute the next computation:
>>> optim.zero_grad()
Backpropagation on L_bg:
>>> L_bg.backward(retain_graph=True)
Here again, we collect the gradient norms, i.e. the gradient of dL/dθ_bg, this time using the bg_params function:
>>> pgrad_bg = torch.stack([p.grad.norm()
for p in m.bg_params() if p.grad is not None])
Now you have pgrad_ag and pgrad_bg which correspond to the gradient norms of dL/dθ_bg, and dL/dθ_bg respectively.

Multiple matrix multiplication loses weight updates

When in forward method I only do one set of torch.add(torch.bmm(x, exp_w), self.b) then my model is back propagating correctly. When I add another layer - torch.add(torch.bmm(out, exp_w2), self.b2) - then the gradients are not updated and the model isn't learning. If I change the activation function from nn.Sigmoid to nn.ReLU then it works with two layers.
Been thinking about this a day now, and not figuring out why it's not working with nn.Sigmoid.
I've tried different learning rates, Loss functions and optimization functions, but no combination seems to work. When I add the weights together before and after training they are the same.
Code:
class MyModel(nn.Module):
def __init__(self, input_dim, output_dim):
torch.manual_seed(1)
super(MyModel, self).__init__()
self.input_dim = input_dim
self.output_dim = output_dim
hidden_1_dimentsions = 20
self.w = torch.nn.Parameter(torch.empty(input_dim, hidden_1_dimentsions).uniform_(0, 1))
self.b = torch.nn.Parameter(torch.empty(hidden_1_dimentsions).uniform_(0, 1))
self.w2 = torch.nn.Parameter(torch.empty(hidden_1_dimentsions, output_dim).uniform_(0, 1))
self.b2 = torch.nn.Parameter(torch.empty(output_dim).uniform_(0, 1))
def activation(self):
return torch.nn.Sigmoid()
def forward(self, x):
x = x.view((x.shape[0], 1, self.input_dim))
exp_w = self.w.expand(x.shape[0], self.w.size(0), self.w.size(1))
out = torch.add(torch.bmm(x, exp_w), self.b)
exp_w2 = self.w2.expand(out.shape[0], self.w2.size(0), self.w2.size(1))
out = torch.add(torch.bmm(out, exp_w2), self.b2)
out = self.activation()(out)
return out.view(x.shape[0])
Besides loss functions, activation functions and learning rates, your parameter initialisation is also important. I suggest you to take a look at Xavier initialisation: https://pytorch.org/docs/stable/nn.html#torch.nn.init.xavier_uniform_
Furthermore, for a wide range of problems and network architectures Batch Normalization, which ensures that your activations have zero mean and standard deviation, helps: https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm1d
If you are interested to know more about the reason for this, it's mostly due to the vanishing gradient problem, which means that your gradients get so small that your weights don't get updated. It's so common that it has its own page on Wikipedia: https://en.wikipedia.org/wiki/Vanishing_gradient_problem

Fitting a sine wave with Keras and PYMC3 yields unexpected results

I've been trying to fit a sine curve with a keras (theano backend) model using pymc3. I've been using this [http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learning/] as a reference point.
A Keras implementation alone fit using optimization does a good job, however Hamiltonian Monte Carlo and Variational sampling from pymc3 is not fitting the data. The trace is stuck at where the prior is initiated. When I move the prior the posterior moves to the same spot. The posterior predictive of the bayesian model in cell 59 is barely getting the sine wave, whereas the non-bayesian fit model gets it near perfect in cell 63. I created a notebook here: https://gist.github.com/tomc4yt/d2fb694247984b1f8e89cfd80aff8706 which shows the code and the results.
Here is a snippet of the model below...
class GaussWeights(object):
def __init__(self):
self.count = 0
def __call__(self, shape, name='w'):
return pm.Normal(
name, mu=0, sd=.1,
testval=np.random.normal(size=shape).astype(np.float32),
shape=shape)
def build_ann(x, y, init):
with pm.Model() as m:
i = Input(tensor=x, shape=x.get_value().shape[1:])
m = i
m = Dense(4, init=init, activation='tanh')(m)
m = Dense(1, init=init, activation='tanh')(m)
sigma = pm.Normal('sigma', 0, 1, transform=None)
out = pm.Normal('out',
m, 1,
observed=y, transform=None)
return out
with pm.Model() as neural_network:
likelihood = build_ann(input_var, target_var, GaussWeights())
# v_params = pm.variational.advi(
# n=300, learning_rate=.4
# )
# trace = pm.variational.sample_vp(v_params, draws=2000)
start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
step = pm.HamiltonianMC(scaling=start)
trace = pm.sample(1000, step, progressbar=True)
The model contains normal noise with a fixed std of 1:
out = pm.Normal('out', m, 1, observed=y)
but the dataset does not. It is only natural that the predictive posterior does not match the dataset, they were generated in a very different way. To make it more realistic you could add noise to your dataset, and then estimate sigma:
mu = pm.Deterministic('mu', m)
sigma = pm.HalfCauchy('sigma', beta=1)
pm.Normal('y', mu=mu, sd=sigma, observed=y)
What you are doing right now is similar to taking the output from the network and adding standard normal noise.
A couple of unrelated comments:
out is not the likelihood, it is just the dataset again.
If you use HamiltonianMC instead of NUTS, you need to set the step size and the integration time yourself. The defaults are not usually useful.
Seems like keras changed in 2.0 and this way of combining pymc3 and keras does not seem to work anymore.

In Tensorflow, What kind of neural network should I use?

I am doing Tensorflow tutorial, getting what TF is. But I am confused about what neural network should I use in my work.
I am looking at Single Layer Neural Network, CNN, RNN, and LSTM RNN.
There is a sensor which measures something and represents the result in 2 boolean ways. Here, they are Blue and Red, like this:
the sensor gives result values every 5minutes. If we pile up the values for each color, we can see some patterns:
number inside each circle represents the sequence of result values given from sensor. (107 was given right after 106) when you see from 122 to 138, you can see decalcomanie-like pattern.
I want to predict the next boolean value before the sensor result. I may do supervised learning using past results. But I'm not sure which neural network or method is suitable. Thinking that this work needs pattern using past results (have to see context), and memorize past results, maybe LSTM RNN (long-short term memory recurrent neural network) would be suitable one. Could you tell me what is the right one?
So it sounds like you need to process a sequences of images. You could actually use both CNN and RNN together. I did this a month ago when I was training a network to swipe left or right on tinder using the sequence of profile pictures. What you would do is pass all of the images through a CNN and then into the RNN. Below is part of the code for my tinder bot. See how I distribute the convolutions over the sequence and then push it through the RNN. Finally I put a softmax classifier on the last time step to make the prediction, however in your case I think you will distribuite the prediction in time since you want the next item in the sequence.
self.input_tensor = tf.placeholder(tf.float32, (None, self.max_seq_len, self.img_height, self.img_width, 3), 'input_tensor')
self.expected_classes = tf.placeholder(tf.int64, (None,))
self.is_training = tf.placeholder_with_default(False, None, 'is_training')
self.learning_rate = tf.placeholder(tf.float32, None, 'learning_rate')
self.tensors = {}
activation = tf.nn.elu
rnn = tf.nn.rnn_cell.LSTMCell(256)
with tf.variable_scope('series') as scope:
state = rnn.zero_state(tf.shape(self.input_tensor)[0], tf.float32)
for t, img in enumerate(reversed(tf.unpack(self.input_tensor, axis = 1))):
y = tf.map_fn(tf.image.per_image_whitening, img)
features = 48
for c_layer in range(3):
with tf.variable_scope('pool_layer_%d' % c_layer):
with tf.variable_scope('conv_1'):
filter = tf.get_variable('filter', (3, 3, y.get_shape()[-1].value, features))
b = tf.get_variable('b', (features,))
y = tf.nn.conv2d(y, filter, (1, 1, 1, 1), 'SAME') + b
y = activation(y)
self.tensors['img_%d_conv_%d' % (t, 2 * c_layer)] = y
with tf.variable_scope('conv_2'):
filter = tf.get_variable('filter', (3, 3, y.get_shape()[-1].value, features))
b = tf.get_variable('b', (features,))
y = tf.nn.conv2d(y, filter, (1, 1, 1, 1), 'SAME') + b
y = activation(y)
self.tensors['img_%d_conv_%d' % (t, 2 * c_layer + 1)] = y
y = tf.nn.max_pool(y, (1, 3, 3, 1), (1, 3, 3, 1), 'SAME')
self.tensors['pool_%d' % c_layer] = y
features *= 2
print(y.get_shape())
with tf.variable_scope('rnn'):
y = tf.reshape(y, (-1, np.prod(y.get_shape().as_list()[1:])))
y, state = rnn(y, state)
self.tensors['rnn_%d' % t] = y
scope.reuse_variables()
with tf.variable_scope('output_classifier'):
W = tf.get_variable('W', (y.get_shape()[-1].value, 2))
b = tf.get_variable('b', (2,))
y = tf.nn.dropout(y, tf.select(self.is_training, 0.5, 1.0))
y = tf.matmul(y, W) + b
self.tensors['classifier'] = y
Yes, an RNN (recurrent neural network) fits the task of accumulating state along along a sequence in order to predict its next element. LSTM (long short-term memory) is a particular design for the recurrent pieces of the network that has turned out to be very successful in avoiding numerical challenges from long-lasting recurrences; see colah's much-cited blogpost for more. (Alternatives to the LSTM cell design exist but I would only fine tune that much later, possibly never.)
The TensorFlow RNN codelab explains LSTM RNNs for the case of language models, which predict the (n+1)-st word of a sentence from the preceding n words, for each n (like for each timestep in your series of measurements). Your case is simpler than language models in that you only have two words (red and blue), so if you read anything about embeddings of words, ignore it.
You also mentioned other types of neural networks. These are not aimed at accumulating state along a sequence, such as your boolean sequence of red/blue inputs. However, your second image suggests that there might be pattern in the sequence of counts of successive red/blue values. You could try using the past k counts as input to a plain feed-forward (i.e., non-recursive) neural network that predicts the probability of the next measurement having the same color as the current one. - Maybe that works with a single layer, or maybe two or even three work better; experimentation will tell. This is a less fancy approach than an RNN, but if it works good enough, it gives you a simpler solution with fewer technicalities to worry about.
CNNs (convolutional neural networks) would not be my first choice here. These aim to discover a set of fixed-scale features at various places in the input, for example, some texture or curved edge anywhere in an image. But you only want to predict one next item that extends your input sequence. A plain neural network (see above) may discover useful patterns on the k previous values, and training it with all earlier partial sequences will help it find those patterns. The CNN approach would help to discover them during prediction at long-gone parts of the input; I have no intuition why that would help.

Bicoin price prediction using spark and scala [duplicate]

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)