CNN: Why stack same activation maps on top of each other - neural-network

I am wondering why we stack basically identical activation maps on top of each other? Since it's always the same filter applied on the same input, wouldn't it be always the same activation map?
If that's the case, we wouldn't even need to recompute the activation map, but just copy the activation map N times. What additional information does this provide us? Yes, we create again a layer with depth (output volume), but if it's the same value, what is the rational behind it?
Src: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture5.pdf

It is not the same convolution, you have separate, independent kernel (filters) for each activation map (independent weights), thus they are completely different. Without it, after convolution we would only have one "type of feature" extracted, say edges; while for CNNs to work we need plenty of these.
In the example provided the "green 5x5x3 filter" produces one, green activation map, then you have separate "blue 5x5x3 filter" that produces blue activation map and so on.

Actually we don't stack same activation maps on top of each other.It is only the shape of each activation map that retains the same(in your case, it is 5*5*3).But the weights among different activation maps are absolutely different.
1. each activation map share the same weights. that is a single activation map is used to detect a specific kind of feature appears in different locations of the original map.
2. different activation maps are used to detect different kinds of features.

Related

Computer Vision: Neural Network to Generate Points on a Map

I want to design a neural network / ConvNet to generate a set of points on a given map, which correspond to possible positions of a robot. The map contains a lot of empty space for walls, and the robots can't be in those positions. Therefore, the network should take in the map, and generate pairs of numbers (x, y) corresponding to places on the map that is not wall. What would be an appropriate choice of neural network structure to implement this task?
The approach you might have to take will depend upon whether you wish to generalize to new unseen maps and be able to segment the map in to feasible(available for robot navigation) and infeasible (wall/other ojects/obstacles). Please be aware that you need to generate these maps dynamically if your environment will change over time(like moving obstacles/other robots/objects). For this if you have a good amount of annotated training data with maps with wall regions marked(segmented out), you could use a standard neural network based segmentation algorithm like Mask-RCNN (https://github.com/matterport/Mask_RCNN) on your dataset. Alternatively, if do not have a lot of annotated data and you just want a general purpose path planning algorithm, that can plan on a path from point A to B with out running in to obstacles you could use a MPC based obstacle avoidance algorithms as ones described in https://arxiv.org/abs/1805.09633 / https://www.tandfonline.com/doi/full/10.1080/00423114.2018.1492141

Usage of indicator functions as features in Sequential Models

I am currently using Mallet for training a sequential model using CRF. I have understood how to provide features (that solely depend on input sequence) to the mallet package. Based on my understanding, in mallet, we have to compute all the values of the feature functions (upfront). Now, I would like to use indicator functions that depend on the label of a token. The value of these functions depends on the output label sequence and during training, I can compute the values of these indicator functions as the output label sequence is known. But, when I am applying this trained CRF model on a new input (whose output label sequene is unknown), how should I calculate the values for such features.
It will be very helpful to me if anyone can provide me any tips/relevant documents.
As you've phrased it, the question doesn't make sense: if you don't know the hidden labels, you can't set anything based on those unknown labels. An example might help.
You may not need to explicitly record these relationships. At training time the algorithm sets the parameters of the CRF to represent the relationship between the observed features and the unobserved state. Different CRF architectures can allow you to add dependencies between multiple hidden states.

Siamese networks: Why does the network to be duplicated?

The DeepFace paper from Facebook uses a Siamese network to learn a metric. They say that the DNN that extracts the 4096 dimensional face embedding has to be duplicated in a Siamese network, but both duplicates share weights. But if they share weights, every update to one of them will also change the other. So why do we need to duplicate them?
Why can't we just apply one DNN to two faces and then do backpropagation using the metric loss? Do they maybe mean this and just talk about duplicated networks for "better" understanding?
Quote from the paper:
We have also tested an end-to-end metric learning ap-
proach, known as Siamese network [8]: once learned, the
face recognition network (without the top layer) is repli-
cated twice (one for each input image) and the features are
used to directly predict whether the two input images be-
long to the same person. This is accomplished by: a) taking
the absolute difference between the features, followed by b)
a top fully connected layer that maps into a single logistic
unit (same/not same). The network has roughly the same
number of parameters as the original one, since much of it
is shared between the two replicas, but requires twice the
computation. Notice that in order to prevent overfitting on
the face verification task, we enable training for only the
two topmost layers.
Paper: https://research.fb.com/wp-content/uploads/2016/11/deepface-closing-the-gap-to-human-level-performance-in-face-verification.pdf
The short answer is that yes, I think that looking at the architecture of the network will help you understand what is going on. You have two networks that are "joined at the hip" i.e. sharing weights. That's what makes it a "Siamese network". The trick is that you want the two images you feed into the network to pass through the same embedding function. So to ensure that this happens both branches of the network need to share weights.
Then we combine the two embeddings into a metric loss (called "contrastive loss" in the image below). And we can back-propagate as normal, we just have two input branches available so that we can feed in two images at a time.
I think a picture is worth a thousand words. So check out how a Siamese network is constructed (at least conceptually) below.
The gradients depend on the activation values. So for each branch gradients will be different and final update could be based on some averaging to share the weights

Episodic Semi-gradient Sarsa with Neural Network

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).
So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton:
How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?
Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
I wondered how I choose the optimal action based on the currently learned weights of the network
You have three basic choices:
Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)
Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.
Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.
Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)
Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.

Self organized map understaning

I have a self organized map created a Som_pak-3.1 here
If I have three different type of elements, and they are different. Why the elements are not in different parts of the map? Why the "A", "B" and "C" are in many many cases together at the same hexagon? Why "B" and "C" are never alone in an hexagon?
Thanks in advance!
I feel that it is a normal result for SOM. The unsupervised SOM algorithm is not aware of the elements. Using the distance metric, the neurons have learned the vectors, and then the elements were placed as labels at the best matching neuron.
One possible reason for two elements appearing on the same node is if they have the same values for each of the features. Otherwise, they have different values for each feature, but the values still seem similar according to the distance metric.
The spatial resolution can be increased by increasing the map size. This may allow the classes to be separable. However, the trade-off is that statistical significance of each neuron goes down when it is associated with fewer data points. So what I would suggest is that you can try different sizes of maps to find the one that is appropriate for your data set and goals.
Actually I was just reading about this exact point, p. 19 in Kohonen's book "MATLAB Implementations and Applications of the Self-Organizing Map" available at http://docs.unigrafia.fi/publications/kohonen_teuvo/. It covers the MATLAB SOM-Toolkit that was created after SOM-PAK. The book only briefly covers SOM-PAK but I believe that the theory from the book would help out.