In Wikipedia article about MNIST database it is said, that lowest error rate is of "committee of 35 convolutional networks" with the scheme:
What does this scheme mean?
Numbers are probably neuron numbers. But what does 1 mean then?
What do P letters mean?

In this particular scheme, 'P' means 'pooling' layer.
So, basic structure is following:
One grayscale input image
20 images after convolution layer (20 different filters)
Pooling layer
40 outputs from next convolution
Pooling layer
150... can be either 150 small convolution outputs or just fully-connected 150 neurons
10 output fully-connected neurons
That's why 1-20-P-40-P-150-10. Not best notation, but still pretty clear if you familiar with CNN.
You can read more details about internal structure of CNN in base article of Yann LeCun "Gradient-Based Learning Applied to Document Recognition".


why is tanh performing better than relu in simple neural network

Here is my scenario
I have used EMNIST database of capital letters of english language.
My neural network is as follows
Input layer has 784 neurons which are pixel values of image 28x28 grey scaled image divided by 255 so value will be in range[0,1]
Hidden layer has 49 neuron fully connected to previous 784.
Output layer has 9 neurons denoting class of image.
Loss function is defined as cross entropy of softmax of output layer.
Initialized all weights as random real number from [-1,+1].
Now I did training with 500 fixed samples for each class.
Simply, passed 500x9 images to train function which uses backpropagation and does 100 iterations changing weights by learning_rate*derivative_of_loss_wrt_corresponding_weight.
I found that when I use tanh activation on neuron then network learns faster than relu with learning rate 0.0001.
I concluded that because accuracy on fixed test dataset was higher for tanh than relu . Also , loss value after 100 epochs was slightly lower for tanh.
Isn't relu expected to perform better ?
In general, no. RELU will perform better on many problems but not all problems.
Furthermore, if you use an architecture and set of parameters that is optimized to perform well with one activation function, you may get worse results after swapping in a different activation function.
Often you will need to adjust the architecture and parameters like learning rate to get comparable results. This may mean changing the number of hidden nodes and/or the learning rate in your example.
One final note: In the MNIST example architectures I have seen, hidden layers with RELU activations are typically followed by Dropout layers, whereas hidden layers with sigmoid or tanh activations are not. Try adding dropout after the hidden layer and see if that improves your results with RELU. See the Keras MNIST example here.

Can a convolutional neural network be built with perceptrons?

I was reading this interesting article on convolutional neural networks. It showed this image, explaining that for every receptive field of 5x5 pixels/neurons, a value for a hidden value is calculated.
We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information.
So max-pooling is applied.
With multiple convolutional layers, it looks something like this:
But my question is, this whole architecture could be build with perceptrons, right?
For every convolutional layer, one perceptron is needed, with layers:
input_size = 5x5;
hidden_size = 10; e.g.
output_size = 1;
Then for every receptive field in the original image, the 5x5 area is inputted into a perceptron to output the value of a neuron in the hidden layer. So basically doing this for every receptive field:
So the same perceptron is used 24x24 amount of times to construct the hidden layer, because:
is that we're going to use the same weights and bias for each of the 24×24 hidden neurons.
And this works for the hidden layer to the pooling layer as well, input_size = 2x2; output_size = 1;. And in the case of a max-pool layer, it's just a max() function on an array.
and then finally:
The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons.
which is a perceptron again.
So my final architecture looks like this:
-> 1 perceptron for every convolutional layer/feature map
-> run this perceptron for every receptive field to create feature map
-> 1 perceptron for every pooling layer
-> run this perceptron for every field in the feature map to create a pooling layer
-> finally input the values of the pooling layer in a regular ALL to ALL perceptron
Or am I overseeing something? Or is this already how they are programmed?
The answer very much depends on what exactly you call a Perceptron. Common options are:
Complete architecture. Then no, simply because it's by definition a different NN.
A model of a single neuron, specifically y = 1 if (w.x + b) > 0 else 0, where x is the input of the neuron, w and b are its trainable parameters and w.b denotes the dot product. Then yes, you can force a bunch of these perceptrons to share weights and call it a CNN. You'll find variants of this idea being used in binary neural networks.
A training algorithm, typically associated with the Perceptron architecture. This would make no sense to the question, because the learning algorithm is in principle orthogonal to the architecture. Though you cannot really use the Perceptron algorithm for anything with hidden layers, which would suggest no as the answer in this case.
Loss function associated with the original Perceptron. This notion of Peceptron is orthogonal to the problem at hand, you're loss function with a CNN is given by whatever you try to do with your whole model. You can eventually use it, but it is non-differentiable, so good luck :-)
A sidenote rant: You can see people refer to feed-forward, fully-connected NNs with hidden layers as "Multilayer Perceptrons" (MLPs). This is a misnomer, there are no Perceptrons in MLPs, see e.g. this discussion on Wikipedia -- unless you go explore some really weird ideas. It would make sense call these networks as Multilayer Linear Logistic Regression, because that's what they used to be composed of. Up till like 6 years ago.

How to enforce feature vector representing label probability with Caffe siamese CNN?

related to How to Create CaffeDB training data for siamese networks out of image directory
If I have N labels. How can I enforce, that the feature vector of size N right before the contrastive loss layer represents some kind of probability for each class? Or comes that automatically with the siamese net design?
If you only use contrastive loss in a Siamese network, there is no way of forcing the net to classify into the correct label - because the net is only trained using "same/not same" information and does not know the semantics of the different classes.
What you can do is train with multiple loss layers.
You should aim at training a feature representation that is reach enough for your domain, so that looking at the trained feature vector of some input (in some high dimension) you should be able to easily classify that input to the correct class. Moreover, given that feature representation of two inputs one should be able to easily say if they are "same" or "not same".
Therefore, I recommend that you train your deep network with two loss layer with "bottom" as the output of one of the "InnerProduct" layers. One loss is the contrastive loss. The other loss should have another "InnerProduct" layer with num_output: N and a "SoftmaxWithLoss" layer.
A similar concept was used in this work:
Sun, Chen, Wang and Tang Deep Learning Face Representation by Joint Identification-Verification NIPS 2014.

Number of feature maps in convolution neural network

I've read this articles http://www.codeproject.com/Articles/143059/Neural-Network-for-Recognition-of-Handwritten-Di and when I turn to this one:
Layer #0: is the gray scale image of the handwritten character in the MNIST database which is padded to 29x29 pixel. There are 29x29= 841 neurons in the input layer.
Layer #1: is a convolutional layer with six (6) feature maps. There are 13x13x6 = 1014 neurons, (5x5+1)x6 = 156 weights, and 1014x26 = 26364 connections from layer #1 to the previous layer.
How can we get the six(6) feature maps just from convolution on image ?
I think we just get only one feature map. Or am i wrong ?
I'm doing my research around convolution neural network.
Six different kernels(or filters) are convoluted on the same image to generate six feature map.
Layer #0: Input image with 29x29 pixels thus have 29*29=841 neuron(input neuron)
Layer #1: Convolutional layer uses 6 different kernels(or filters) of size 5x5 pixel and stride length 2(amount of shift while convoluting input with kernals or filters) which are convoluted with the input image(29x29) generating 6 different feature maps(13x13) thus 13x13x6=1014 neuron.
Filter size 5x5 and a bias(for weight correction) thus (5x5)+1 neuron and was we have 6 kernals(or filters), gives 6*[(5x5)+1]= 156 neuron.
During convolution we move kernels(or filters) 26 times(13 horizontal move + 13 vertical move) and finally 1014*26=26364 connections from Layer #0 to Layer #1.
You should go through this research paper by Y LeCun, L Bottou, Y Bengio: Gradient- Based learing applied to document recognition Section II to understand convolution neural network(I recommend to read the whole paper).
Another place where you can find detailed explanation and python implementation fo CNN is here. If you have time I recommend to go through this site for more details about deep learning.
Thank you.
you get six feature maps by convolving with six different kernel on the same image.

Characters Recognition for Matlab Neural Network

I am working on my final project. I chose to implement a NN for characters recognition.
My plan is to take 26 images containg 26 English letters as training data, but I have no idea how to convert these images as inputs to my neural network.
Let's say I have a backpropagation neural network with 2 layers - a hidden layer and an output layer. The output layer has 26 neurons that produces 26 letters. I self created 26 images (size is 100*100 pixels in 24bit bmp format) that each of them contains a English letter. I don't need to do image segmentation, Because I am new to the image processing, so can you guys give me some suggestions on how to convert images into input vectors in Matlab (or do I need to do edge, morphology or other image pre-processing stuffs?).
Thanks a lot.
You NN will work only if the letters are the same (position of pixels is fixed). You need to convert images to gray-scale and pixelize them. In other words, use grid that split images on squares. Squares have to be small enough to get letter details but large enough so you don't use too much neurons. Each pixel (in gray scale) is a input for the NN. What is left is to determine the way to connect neurons e.g NN topology. Two layers NN should be enough. Most probably you should connect each input "pixel" to each neuron at first layer and each neuron at first layer to each neuron at second layer
This doesn't directly answer the questions you asked, but might be useful:
1) You'll want more training data. Much more, if I understand you correctly (only one sample for each letter??)
2) This is a pretty common project, and if it's allowed, you might want to try to find already-processed data sets on the internet so you can focus on the NN component.
Since you will be doing character recognition I suggest you use a SOM neural network which does not require any training data; You will have 26 input neurons one neuron for each letter. For the image processing bit Ross has a usefull suggestion for isolating each letter.