Between these two setups, is there any difference in performance?
There is a difference in performance. The fully connected network (the right image) can generalize on inputs the left one cannot.
Related
The DeepFace paper from Facebook uses a Siamese network to learn a metric. They say that the DNN that extracts the 4096 dimensional face embedding has to be duplicated in a Siamese network, but both duplicates share weights. But if they share weights, every update to one of them will also change the other. So why do we need to duplicate them?
Why can't we just apply one DNN to two faces and then do backpropagation using the metric loss? Do they maybe mean this and just talk about duplicated networks for "better" understanding?
Quote from the paper:
We have also tested an end-to-end metric learning ap-
proach, known as Siamese network [8]: once learned, the
face recognition network (without the top layer) is repli-
cated twice (one for each input image) and the features are
used to directly predict whether the two input images be-
long to the same person. This is accomplished by: a) taking
the absolute difference between the features, followed by b)
a top fully connected layer that maps into a single logistic
unit (same/not same). The network has roughly the same
number of parameters as the original one, since much of it
is shared between the two replicas, but requires twice the
computation. Notice that in order to prevent overfitting on
the face verification task, we enable training for only the
two topmost layers.
Paper: https://research.fb.com/wp-content/uploads/2016/11/deepface-closing-the-gap-to-human-level-performance-in-face-verification.pdf
The short answer is that yes, I think that looking at the architecture of the network will help you understand what is going on. You have two networks that are "joined at the hip" i.e. sharing weights. That's what makes it a "Siamese network". The trick is that you want the two images you feed into the network to pass through the same embedding function. So to ensure that this happens both branches of the network need to share weights.
Then we combine the two embeddings into a metric loss (called "contrastive loss" in the image below). And we can back-propagate as normal, we just have two input branches available so that we can feed in two images at a time.
I think a picture is worth a thousand words. So check out how a Siamese network is constructed (at least conceptually) below.
The gradients depend on the activation values. So for each branch gradients will be different and final update could be based on some averaging to share the weights
I have some general questions about NNs and their training in hope that you can answer them:
Lets propose, that Ive got an untrained NN with n hidden Layers and m neurons in it. I want to train the network to, eg recognice voice and so words. How can I make this possible when my sound input doesnt always have the same length (eg one is 1 second the other one is 5)? How many layers and what type should my NN be (Recurrent,LSTM,CNNs etc)? Are there any other training algorithms than the normal backpropagation ( I thought about having a NN with just one neuron in each Layer and then let grow new one till the problem could be solved)? And finally is it recommended/helpfull to make connections between the neurons of eg Layer 2 to Layer 4?
Thank you about your help!
This is a perfectly valid question, for your record.
You should definitely use a recurrent network for voice recognition. So that means you output say 1/100 of a second one by one. So for one second, you activate the network 100 times for one second of data.
Using an LSTM will make sure that patterns over large time lags are remembered, so the network will essentially rememember (useful) parts from previous inputs.
How many layers you should use is dependant on what exactly you want to recognize. But because voice recognition is not one of the easiest classification tasks, it will have to be a large deep network (combine convolutional with LSTM).
What you proposed, evolving the network one node by one, is basically called neuroevolution. Libraries such as Neataptic support the evolution of networks towards a certain solution.
Yes, that could definitely help. But this can purely be found out by trial and error.
PS: I strongly recommend to start on an easier task to develop an understanding of neural networks.
If I've understood correctly, when training neural networks to recognize objects in images it's common to map single pixel to a single input layer node. However, sometimes we might have a large picture with only a small area of interest. For example, if we're training a neural net to recognize traffic signs, we might have images where the traffic sign covers only a small portion of it, while the rest is taken by the road, trees, sky etc. Creating a neural net which tries to find a traffic sign from every position seems extremely expensive.
My question is, are there any specific strategies to handle these sort of situations with neural networks, apart from preprocessing the image?
Thanks.
Using 1 pixel per input node is usually not done. What enters your network is the feature vector and as such you should input actual features, not raw data. Inputing raw data (with all its noise) will not only lead to bad classification but training will take longer than necessary.
In short: preprocessing is unavoidable. You need a more abstract representation of your data. There are hundreds of ways to deal with the problem you're asking. Let me give you some popular approaches.
1) Image proccessing to find regions of interest. When detecting traffic signs a common strategy is to use edge detection (i.e. convolution with some filter), apply some heuristics, use a threshold filter and isolate regions of interest (blobs, strongly connected components etc) which are taken as input to the network.
2) Applying features without any prior knowledge or image processing. Viola/Jones use a specific image representation, from which they can compute features in a very fast way. Their framework has been shown to work in real-time. (I know their original work doesn't state NNs but I applied their features to Multilayer Perceptrons in my thesis, so you can use it with any classifier, really.)
3) Deep Learning.
Learning better representations of the data can be incorporated into the neural network itself. These approaches are amongst the most popular researched atm. Since this is a very large topic, I can only give you some keywords so that you can research it on your own. Autoencoders are networks that learn efficient representations. It is possible to use them with conventional ANNs. Convolutional Neural Networks seem a bit sophisticated at first sight but they are worth checking out. Before the actual classification of a neural network, they have alternating layers of subwindow convolution (edge detection) and resampling. CNNs are currently able to achieve some of the best results in OCR.
In every scenario you have to ask yourself: Am I 1) giving my ANN a representation that has all the data it needs to do the job (a representation that is not too abstract) and 2) keeping too much noise away (and thus staying abstract enough).
We usually dont use fully connected network to deal with image because the number of units in the input layer will be huge. In neural network, we have specific neural network to deal with image which is Convolutional neural network(CNN).
However, CNN plays a role of feature extractor. The encoded feature will finally feed into a fully connected network which act as a classifier. In your case, I dont know how small your object is compare to the full image. But if the interested object is really small, even use CNN, the performance for image classification wont be very good. Then we probably need to use object detection(which used sliding window) to deal with it.
If you want recognize small objects on large sized image, you should use "scanning window".
For "scanning window" you can to apply dimention reducing methods:
DCT (http://en.wikipedia.org/wiki/Discrete_cosine_transform)
PCA (http://en.wikipedia.org/wiki/Principal_component_analysis)
How is the initial structure (Neurons and connections between them) chosen? My book only states that we give the connection random weights in the beginning before we train the network.
I think that we would add neurons during the training like this:
Start with a completely empty network
The first value I generate during the training will not exist
Add a neuron to correspond to this value, with a random weight
What you are after is a self-organizing ANN. Usually, the way the connections are organized is man-made into a model that the developer thinks will have sufficient power to perform the computation neccessary. You can of course start with a random selection of nodes with random connections, but the evolution of such a network will probably take a lot longer time than a standard two or three layer network.
So, yes, you are right in that you would use a similar approach when doing a self-organizing network. Keep track of two sets of genetic algorithms, one for the structure and one for the weights (or combine the two in some devious way) and evolve as you please.
I do not believe the question is about self-organising or GA-evolved ANNs. It sounds more like it is about a the most common ANN: a perceptron (single or multi-layer), in which case the structure of the network: the number of layers and the size of the layers, must be hand chosen at the beginning. A simple initial rule of thumb for initialising the weight is simply picking uniformly random values between -1.0 and 1.0.
I'm trying to test the efficiency of the Neural Networks as approximation functions.
The function I need to approximate has 5 inputs and 1 output, which structure should I use?
I have no idea on what criteria should be applied in order to decide the number of Hidden Layer and the number of Nodes for each layer.
Thank you in advance,
Regards
Giuseppe.
I always use a single hidden layer. Theoretically, there are no functions which can be approximated by 2 or more hidden layers that cannot be approximated with one. To make a single hidden layer more complex, add more hidden nodes.
Typically, the number of hidden nodes is varied to observe the effect on model performance (as measured by accuracy or whatever). Too few hidden nodes results in a worse fit due to underfitting (the neural network's output function is too simple, and misses important details in the data). Too many hidden nodes results in a worse fit due to overfitting (the neural network becomes so flexible that it chases every bit of noise in the data).
Note that for classification problems you need at least 2 hidden layers if you want to separate concave polygons.
I'm not sure how the number of hidden layers affects function approximation.