I am working on my final project. I chose to implement a NN for characters recognition.
My plan is to take 26 images containg 26 English letters as training data, but I have no idea how to convert these images as inputs to my neural network.
Let's say I have a backpropagation neural network with 2 layers - a hidden layer and an output layer. The output layer has 26 neurons that produces 26 letters. I self created 26 images (size is 100*100 pixels in 24bit bmp format) that each of them contains a English letter. I don't need to do image segmentation, Because I am new to the image processing, so can you guys give me some suggestions on how to convert images into input vectors in Matlab (or do I need to do edge, morphology or other image pre-processing stuffs?).
Thanks a lot.
You NN will work only if the letters are the same (position of pixels is fixed). You need to convert images to gray-scale and pixelize them. In other words, use grid that split images on squares. Squares have to be small enough to get letter details but large enough so you don't use too much neurons. Each pixel (in gray scale) is a input for the NN. What is left is to determine the way to connect neurons e.g NN topology. Two layers NN should be enough. Most probably you should connect each input "pixel" to each neuron at first layer and each neuron at first layer to each neuron at second layer
This doesn't directly answer the questions you asked, but might be useful:
1) You'll want more training data. Much more, if I understand you correctly (only one sample for each letter??)
2) This is a pretty common project, and if it's allowed, you might want to try to find already-processed data sets on the internet so you can focus on the NN component.
Since you will be doing character recognition I suggest you use a SOM neural network which does not require any training data; You will have 26 input neurons one neuron for each letter. For the image processing bit Ross has a usefull suggestion for isolating each letter.
Related
Recently, I've been playing with the MATLAB's RCNN deep learning example here. In this example, MATLAB has designed a basic 15 layer CNN with the input size of 32x32. They use the CIFAR10 dataset to pre-train this CNN. The CIFAR10 dataset has training images of size 32x32 too. Later they use a small dataset of stop signs to fine tune this CNN to detect stop signs. This small dataset of stop signs has only 41 images; so they use these 41 images to fine tune the CNN and namely train an RCNN network. this is how they detect a stop sign:
As you see the bounding box almost covers the whole stop signs except a small part on the top.
Playing with the code I decided to fine tune the same network pre-trained on the CIFAR10 dataset with the PASCAL VOC dataset but only for the "aeroplane" class.
These are some results I get:
As you see the detected bounding boxes barely cover the whole airplane; so this causes the precision to be 0 later when I evaluate them. I understand that in the original RCNN paper mentioned in the MATLAB example the input size 227x227 and their CNN has 25 layers. Could this be why the detections are not accurate? How does the input size of a CNN affect the end result?
almost surely yes!
when you pass an image through a net, the net tries to minimize the data taken from the image until it gets the most relevant data. during this process, the input shrinks again and again. If, for example, you insert to a net an image that smaller than the wanted, all the data from the image may lost during the pass in the net.
In your case, an optional reason to your results is that the net "looks for" features in limited resolution and maybe the big airplane has over high resolution.
So what I am trying to do is to segment cursive handwritten English words into individual characters. I have applied a simple heuristic approach with artificial intelligence to do a basic over-segmentation of the words something like this:
I am coding this in Matlab. The approach involves preprocessing, slant correction size normalization etc and then thinning pen strokes to 1 pixel width and identify the ligatures present in the image using column sum of pixels of the image. Every column with pixel sum lower than a threshold is a possible segmentation point. Problem is open characters like 'u', 'v, 'm' 'n' and 'w' also have low column sum of pixels and gets segmented.
The approach I have used is a modified version of what is presented in this paper:
cursive script segmentation using neural networks.
Now to improve this arrangement I have to use a neural network to correct these over segmented points and recognize them as bad segmentations. I will write a 'newff' function for that and label the segments as good and bas manually but I fail to understand what should be the input to that neural network?
My guess is that we have to give some image data along with the column number at which possible segments are made(one segmentation point per training sample. The given image has about 40 segmentation points so it will lead to 40 training samples) and have it label as good or bad segment for training.
There will be just one output neuron telling us if the segmentation point is good or bad.
Can I give column sums of all the columns as input to the input layer? How do I tell it what the segmentation point for this training instance is? Won't the actual column number we have to classify as good or bad segment which is the most important value here drown in the sea of this n-dimensional input? (n being width of the image pixel-wise)
Since it has been last asked, I am now using image features in vicinity of each segmented column my heuristic algorithm has returned. These features (like column sum pixel density close to the segmented column) is my input to the neural network with a single output neuron. Target vectors are 1 for a good segmentation point and 0 for a bad one.
I am designing an algorithm for OCR using a neural network. I have 100 images([40x20] matrix) of each character so my input should be 2600x800. I have some question regarding the inputs and target.
1) is my input correct? and can all the 2600 images used in random order?
2) what should be the target? do I have to define the target for all 2600 inputs?
3) as the target for the same character is single, what is the final target vector?
(26x800) or (2600x800)?
Your input should be correct. You have (I am guessing) 26 characters and 100 images of size 800 for each, therefore the matrix looks good. As a side note, that looks pretty big input size, you may want to consider doing PCA and using the eigenvalues for training or just reduce the size of the images. I have been able to train NN with 10x10 images, but bigger== more difficult. Try, and if it doesn't work try doing PCA.
(and 3) Of course, if you want to train a NN you need to give it inputs with outputs, how else re you going to train it? You ourput should be of size 26x1 for each of the images, therefore the output for training should be 2600x26. In each of the outputs you should have 1 for the character index it belongs and zero in the rest.
I read some books but still cannot make sure how should I organize the network. For example, I have pgm image with size 120*100, how the input should be like(like a one dimensional array with size 120*100)? and how many nodes should I adapt.
It's typically best to organize your input image as a 2D matrix. The reason is that the layers at the lower levels of the neural networks used in machine perception tasks are typically locally connected. For example, each neuron of the first layer of such a neural net will only process the pixels of a small NxN patch of the input image. This naturally leads to a 2D structure which can be more easily described with 2D matrices.
For a detailed explanation I'll refer you to the DeepFace paper which describes the stat of the art in face recognition systems.
120*100 one dimensional vector is fine. The locations of the pixel values in that vector does not matter, because all nodes are fully connected with the nodes in the next layer anyway. But you must be consistent with their locations between training, validating, and testing.
The most successful approach so far was to go with a convolutional neural network with 2D input, just as #benoitsteiner stated. For a far simpler example I'd refer you to a LeNet-5, a small neural network developed for MNIST hand-written digit recognition. It is used in EBLearn for face recognition with quite good results.
How are the weights given between the input-neurons and the hidden-neurons and as well between the hidden-neurons and the output-neurons? I am aware that the weights are given randomly at the beginning.
Secondly, I'm doing character recognition and lets say that I have a character of size 8x8 pixels meaning 64 input neurons, that should mean that I need to have 64 output-neurons as well right?
For the output layer size see my answer to the same question here.
I'm unsure what you mean by "how weights are given". Do you mean "trained"? If yes, usually by Backpropagation. If you mean "how it is represented": usually as an array or a matrix.
If you want to read more about fine-tuning for backpropagation, read this paper by LeCun.
On another note: 1 pixel per node as input is something you would never do. You never feed raw data into the network because it contains noise and unneeded information. Find a representation, a model, an encoding or something similar before you feed it into the network. To understand how this is done, you have no other choice than to do some research. There are too many possibilities to give a clear answer.