How do I get the weighted sum of multiple losses & accuracy (caffe) - neural-network

I have trained a network on two different modals of the same image. I pass the data together in one layer but after that, it is pretty much two networks in parallel, they don't share a layer and the two tasks have different set of labels, therefore I have two different loss and accuracy layers.
I have read that caffe averages multiple losses and accuracy (following this question How can I have multiple losses in a network in Caffe?), is it the case only when at least a layer is shared? I intended to create an ensemble, however now it seems like I simply have two different networks. I intended to average the losses & accuracy so that both network branches would contribute to one accuracy. On training I see two separate losses & accuracy. How do I get this average loss & accuracy while testing on a new image pair?
By forwarding the network, is it possible to get two predictions at all? If so, how?

Multiple Losses can be used with one network using the caffe-parameter loss_weight. For example, you can have the following for one of your loss layers with weight 0.5 .
...
layer{
name: "loss_a"
type: "SigmoidCrossEntropyLoss"
bottom: "fc8_a"
bottom: "attributes_a"
top : "loss_a"
loss_weight : 0.5
}
layer{
name: "loss_b"
type: "SigmoidCrossEntropyLoss"
bottom: "fc8_b"
bottom: "attributes_b"
top : "loss_b"
}

Related

How to calculate the time complexity of e sequential neural network in Tensorflow?

I have a sample neural network and am trying to see how much it would cost me to run it on a server and how long it would take to train if, for example, I add 3 more layers with around 4000,3000,2000 nodes in each layer respectively.
I understand that from a high level perspective the network needs to
Feed the inputs and get the results (which in turn will run Sigmoid) from the network which I guess happens in constant time (even tho the output may not be constant or even linear!)
Run Adam to optimize weights/biases which I guess also happens in linear time since it is like Gradient descent and is different in how it manages the learning rate!
Update the weights/biases which is constant!
I can't find a calculator to use and estimate the computation needed and I'm thinking of making one if I can get a good understanding of different variables in a neural network!
This is the code for my Tensorflow model:
const model = tf.sequential();
model.add(tf.layers.flatten({inputShape: [4317, 5]}));
model.add(tf.layers.dense({units: 1000, activation: 'sigmoid'}));
model.add(tf.layers.dense({units: 4316, activation: 'sigmoid'}));
const optimizer = tf.train.adam();
model.compile({
optimizer: optimizer,
loss: 'meanSquaredError'
});
And here is the network summary printed by Tensorflow
_________________________________________________________________
Layer (type) Output shape Param #
=================================================================
flatten_Flatten1 (Flatten) [null,21585] 0
_________________________________________________________________
dense_Dense1 (Dense) [null,1000] 21586000
_________________________________________________________________
dense_Dense2 (Dense) [null,4316] 4320316
=================================================================
Total params: 25906316
Trainable params: 25906316
Non-trainable params: 0
What if I change the activation functions to linear or ReLU?
I have a laptop with 16 GB of memory and 3.2 GHz 8-core ARMv8-A (M1 chip) and it looks like the laptop is taking about a minute to train a batch of 32 inputs.
With N inputs, each weight is used O(N) times per round of training, so assuming M weights you have roughly O(N*M) training time per round. It doesn't really matter where those weights are in your network. Even for recurrent layers (GRU,RNN, LSTM) this stays true.
Where things break down is that you can't let M go to infinity (which is how big-O works) because in that case your network training won't converge anymore. Effectively, it would be O(infinity).

Character Recognition Using Back Propagation Algorithm Testing

Recently I've been working on character recognition using Back Propagation Algorithm. I've taken the image and reduced to 5x7 size, therefore I got 35 pixels and trained the network using those pixels with 35 input neurons, 35 hidden nodes, and 10 output nodes. And I had completed the training successfully and I got weights that I needed. And I've got stuck here. I have my test set and I know I should feed forward the network. But I don't know what to do exactly. My test set will be 4 samples of 1x35. My output layer has 10 neurons. how do I exactly distinguish the characters with the output that I will get? I want to know how this testing works. Please guide me through this stage. Thanks in advance.
One vs All
A common approach for testing these types of neural networks is "one-vs-all" approach. We view each of the output nodes as its own classifier that is giving the probability of the sample being that class vs not being that class.
For instance if you network output [1, 0, ..., 0] then class 1 has high probability of being class 1 vs not being class 1. Class 2 has low probability of being class 2 vs not being class 2, etc.
Ties
In the case of a tie, it is common (in research) to have a random function break the tie. If you get [1, 1, 1, ..., 1] then the function would pick a number from 1-10 and that is your prediction. In practice sometimes an expert system is used to break ties. Perhaps class 1 is more expensive than class 2, so we tie in preference to class 2.
Steps
So the steps are:
Split dataset into test/train set
Train weights on train set
Pass test set forward through the neural network
For each sample, choose the argmax (the output with highest value) as your prediction
In case of tie, choose randomly between all tying classes
Aside
In your particular case, I imagine implementation of this strategy will result in a network that barely beats random performance (10%) accuracy.
I would suggest some reconsidering of the network architecture.
If you look at your 5x7 images, can you tell what number that image was originally? It seems likely that scaling the image down to this size losses too much information that the network cannot distinguish between classes.
Debugging
From what you've described I would look at the following when debugging your network.
Is your data preprocessing (down-scaling) leeching out too much information? Check this by manually investigating a few of the images and seeing if you can tell what the image should be.
Does your one-hot algorithm work? When you convert your targets for training, does it successfully convert 1 -> [1, 0, 0, ..., 0]?
Is your back-prop / gradient descent algorithm correct? You should see (roughly) a monotonic decrease in your loss function while training. Try at every step (or every few steps) printing the loss that you are optimizing. Or even for a very simple gut check, print mean squared error: (P-Y)^2

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.
I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Neural networks, how do they look in coding?

Basically i know the concept of a neural network and what it is, but i can't figure out how it looks when you code it or how do you store the data, i went through many tutorials that i found on google but couldn't find any piece of code, just concepts and algorithms.
Can anyone give me a piece of code of a simple neural network something like "Hello World!"?
What you mainly need is an object representing a single neuron with the corrispective associations with other neurons (that represent synapses) and their weights.
A single neuron in a typical OOP language will be something like
class Synapse
{
Neuron sending;
Neuron receiving;
float weight;
}
class Neuron
{
ArrayList<Synapse> toSynapses;
ArrayList<Synapse> fromSynapses;
Function threshold;
}
where threshold represents the function that is applied on weighted sum of inputs to see if the neuron activates itself and propagates the signal.
Of course then you will need the specific algorithm to feed-forward the net or back-propagate the learning that will operate on this data structure.
The simplest thing you could start implement would be a simple perceptron, you can find some infos here.
You said you're already familiar with neural networks, but since there are many different types of neural networks of differing complexity (convolutional, hebbian, kohonen maps, etc.), I'll go over a simple Feed-forward neural network again, just to make sure we're on the same page.
A basic neural network consists of the following things
Neurons
Input Neuron(s)
Hidden Neurons (optional)
Output Neuron(s)
Links between Neurons (sometimes called synapses in analogy to biology)
An activation function
The Neurons have an activation value. When you evaluate a network, the input nodes' activation is set to the actual input. The links from the input nodes lead to nodes closer to the output, usually to one or more layers of hidden nodes. At each neuron, the input activation is processed using an activation function. Different activation functions can be used, and sometimes they even vary within the neurons of a single network.
The activation function processes the activation of the neuron into it's output. The early experiments usually used a simple threshold function (i.e. activation > 0.5 ? 1 : 0), nowadays a Sigmoid function is often used.
The output of the activation function is then propagated over the links to the next nodes. Each link has an associated weight it applies to its input.
Finally, the output of the network is extracted from the activation of the output neuron(s).
I've put together a very simple (and very verbose...) example here. It's written in Ruby and computes AND, which is about as simple as it gets.
A much trickier question is how to actually create a network that does something useful. The trivial network of the example was created manually, but that is infeasible with more complex problems. There are two approaches I am aware of, with the most common being backpropagation. Less used is neuroevolution, where the weights of the links are determined using a genetic algorithm.
AI-Junkie has a great tutorial on (A)NNs and they have the code posted there.
Here is a neuron (from ai-junkie):
struct SNeuron
{
//the number of inputs into the neuron
int m_NumInputs;
//the weights for each input
vector<double> m_vecWeight;
//ctor
SNeuron(int NumInputs);
};
Here is a neuron layer (ai-junkie):
struct SNeuronLayer
{
//the number of neurons in this layer
int m_NumNeurons;
//the layer of neurons
vector<SNeuron> m_vecNeurons;
SNeuronLayer(int NumNeurons, int NumInputsPerNeuron);
};
Like I mentioned before... you can find all of the code with the ai-junkie (A)NN tutorial.
This is the NUPIC programmer's guide. NuPIC is the framework to implement their theory (HTM) based on the structure and operation of the neocortex
This is how they define HTM
HTM technology has the potential to solve many difficult problems in machine learning, inference, and prediction. Some of the application areas we are exploring with our customers include recognizing objects in images, recognizing behaviors in videos, identifying the gender of a speaker, predicting traffic patterns, doing optical character recognition on messy text, evaluating medical images, and predicting click through patterns on the web.
this is a simple net with nument 1.5
from nupic.network import *
from nupic.network.simpledatainterface import WideDataInterface
def TheNet():
net=SimpleHTM(
levelParams=[
{ # Level 0
},
{ # Level 1
'levelSize': 8, 'bottomUpOut': 8,
'spatialPoolerAlgorithm': 'gaussian',
'sigma': 0.4, 'maxDistance': 0.05,
'symmetricTime': True, 'transitionMemory': 1,
'topNeighbors': 2, 'maxGroupSize': 1024,
'temporalPoolerAlgorithm': 'sumProp'
},
{ # Level 2
'levelSize': 4, 'bottomUpOut': 4,
'spatialPoolerAlgorithm': 'product',
'symmetricTime': True, 'transitionMemory': 1,
'topNeighbors': 2, 'maxGroupSize': 1024,
'temporalPoolerAlgorithm': 'sumProp'
},
{ # Level 3
'levelSize': 1,
'spatialPoolerAlgorithm': 'product',
'mapperAlgorithm': 'sumProp'
},],)
Data=WideDataInterface('Datos/__Categorias__.txt', 'Datos/Datos_Entrenamiento%d.txt', numDataFiles = 8)#
net.createNetwork(Data)
net.train(Datos)
if __name__ == '__main__':
print "Creating HTM Net..."
TheNet()