Handwritten Word Segmentation using neural network - matlab

So what I am trying to do is to segment cursive handwritten English words into individual characters. I have applied a simple heuristic approach with artificial intelligence to do a basic over-segmentation of the words something like this:
I am coding this in Matlab. The approach involves preprocessing, slant correction size normalization etc and then thinning pen strokes to 1 pixel width and identify the ligatures present in the image using column sum of pixels of the image. Every column with pixel sum lower than a threshold is a possible segmentation point. Problem is open characters like 'u', 'v, 'm' 'n' and 'w' also have low column sum of pixels and gets segmented.
The approach I have used is a modified version of what is presented in this paper:
cursive script segmentation using neural networks.
Now to improve this arrangement I have to use a neural network to correct these over segmented points and recognize them as bad segmentations. I will write a 'newff' function for that and label the segments as good and bas manually but I fail to understand what should be the input to that neural network?
My guess is that we have to give some image data along with the column number at which possible segments are made(one segmentation point per training sample. The given image has about 40 segmentation points so it will lead to 40 training samples) and have it label as good or bad segment for training.
There will be just one output neuron telling us if the segmentation point is good or bad.
Can I give column sums of all the columns as input to the input layer? How do I tell it what the segmentation point for this training instance is? Won't the actual column number we have to classify as good or bad segment which is the most important value here drown in the sea of this n-dimensional input? (n being width of the image pixel-wise)

Since it has been last asked, I am now using image features in vicinity of each segmented column my heuristic algorithm has returned. These features (like column sum pixel density close to the segmented column) is my input to the neural network with a single output neuron. Target vectors are 1 for a good segmentation point and 0 for a bad one.

Related

Trying to find object coordinates (x,y) in image, my neural network seems to optimize error without learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I generate images of a single coin pasted over a white background of size 200x200. The coin is randomly chosen among 8 euro coin images (one for each coin) and has :
random rotation ;
random size (bewteen fixed bounds) ;
random position (so that the coin is not cropped).
Here are two examples (center markers added): Two dataset examples
I am using Python + Lasagne. I feed the color image into the neural network that has an output layer of 2 linear neurons fully connected, one for x and one for y.
The targets associated to the generated coin images are the coordinates (x,y) of the coin center.
I have tried (from Using convolutional neural nets to detect facial keypoints tutorial):
Dense layer architecture with various number of layers and number of units (500 max) ;
Convolution architecture (with 2 dense layers before output) ;
Sum or mean of squared difference (MSE) as loss function ;
Target coordinates in the original range [0,199] or normalized [0,1] ;
Dropout layers between layers, with dropout probability of 0.2.
I always used simple SGD, tuning the learning rate trying to have a nice decreasing error curve.
I found that as I train the network, the error decreases until a point where the output is always the center of the image. It looks like the output is independent of the input. It seems that the network output is the average of the targets I give. This behavior looks like a simple minimization of the error since the positions of the coins are uniformly distributed on the image. This is not the wanted behavior.
I have the feeling that the network is not learning but is just trying to optimize the output coordinates to minimize the mean error against the targets. Am I right? How can I prevent this? I tried to remove the bias of the output neurons because I thought maybe I'm just modifying the bias and all others parameters are being set to zero but this didn't work.
Is it possible for a neural network alone to perform well at this task?
I have read that one can also train a net for present/not present binary classification and then scan the image to find possible locations of objects. But I just wondered if it was possible just using the forward computation of a neural net.
Question : How can I prevent this [overfitting without improvement to test scores]?
What needs to be done is to re-architect your neural net. A neural net just isn't going to do a good job at predicting an X and Y coordinate. It can through create a heat map of where it detects a coin, or said another way, you could have it turn your color picture into a "coin-here" probability map.
Why? Neurons have a good ability to be used to measure probability, not coordinates. Neural nets are not the magic machines they are sold to be but instead really do follow the program laid out by their architecture. You'd have to lay out a pretty fancy architecture to have the neural net first create an internal space representation of where the coins are, then another internal representation of their center of mass, then another to use the center of mass and the original image size to somehow learn to scale the X coordinate, then repeat the whole thing for Y.
Easier, much easier, is to create a coin detector Convolution that converts your color image to a black and white image of probability-a-coin-is-here matrix. Then use that output for your custom hand written code that turns that probability matrix into an X/Y coordinate.
Question : Is it possible for a neural network alone to perform well at this task?
A resounding YES, so long as you set up the right neural net architecture (like the above), but it would probably be much easier to implement and faster to train if you broke the task into steps and only applied the Neural Net to the coin detection step.

Convolutional Neural Networks, Matrix of Convolutional (Kernel)

Good afternoon! In the first stage where on input of Convolutional Neural Network (input layer) we recieve a source image (hence an image of handwritten English letter). First of all we are using an nxn window which goes from left to right for scanning image and multiplication on kernel (convolutional matrix) to build Feature maps? But nowhere written about what exact values a kernel should be have (In other words on what Kernel values I should multiply data retrieved from n*n window ). Is it suitable to multiply data on this Convolutional Kernel intended for edge detection? There a numerous Convolutional Kernels (Emboss, Gaussian Filter, Edge detection, Angle detection, etc.)? But nowhere is written to what exact kernel it is need to multiply data for detecting hand written symbols.
Sample of Edge detection 3*3 kernel
Convolutional operation for multiplication on kernel
In addition, if size of entire image is 30*30, than is it possible to use window of 5*5 for building feature maps? Would it be enough sufficient for reaching optimal precision of letter detection?
On what exact kernel it is best to multiply area of entire image for the maximum precision of letter recognition? Or initially all values in kernel is equaled to 0? Could i also ask, what formula or rule is applied to detect overall needed amount of to be built feature maps? Or if the task is in letter recognition of English Language, than in each stage of Feature maps building process there must be exact 25 feature maps? Thank you for reply!
In a CNN, the convolutional kernel is a shared weight matrix, and is learned in a similar way to other weights. It is initialized in the same way, with small random values, and the weight deltas from back propagation are summed across all the features that receive its output (i.e. usually all "pixels" in the output of the convolutional layer)
A typical random kernel will perform a little like an edge detector.
After training, the first CNN layer can be displayed and will often have learned some kernels that can be interpreted if you are familiar with image processing
There is a nice animated view of kernel features being learned here: http://cs.nyu.edu/~yann/research/sparse/
In short your answer is this: There is no need to look for correct kernels to use. Instead look for a CNN library where you set params such as number of convolutional layers, and research the way to view the kernels as they learn - most CNN libraries will have a documented way to visualise them.

Regarding Assignment of Input and Target in Neural Network

I am designing an algorithm for OCR using a neural network. I have 100 images([40x20] matrix) of each character so my input should be 2600x800. I have some question regarding the inputs and target.
1) is my input correct? and can all the 2600 images used in random order?
2) what should be the target? do I have to define the target for all 2600 inputs?
3) as the target for the same character is single, what is the final target vector?
(26x800) or (2600x800)?
Your input should be correct. You have (I am guessing) 26 characters and 100 images of size 800 for each, therefore the matrix looks good. As a side note, that looks pretty big input size, you may want to consider doing PCA and using the eigenvalues for training or just reduce the size of the images. I have been able to train NN with 10x10 images, but bigger== more difficult. Try, and if it doesn't work try doing PCA.
(and 3) Of course, if you want to train a NN you need to give it inputs with outputs, how else re you going to train it? You ourput should be of size 26x1 for each of the images, therefore the output for training should be 2600x26. In each of the outputs you should have 1 for the character index it belongs and zero in the rest.

KNN Classifier using cross validation

I am trying to implement KNN classifier using the cross validation approach where I have different images of a certain character for training(e.g 5 images) and another two for testing. Now I get the idea of the cross validation by simply choosing the K with the least error value when training & then using it with the test data to find how accurate my results are.
My question is how do I train images in matlab to get my K value? Do I compare them and try to find mismatch or what?!
Any help would be really appreciated.
First of you need to define your task precisely. F.ex Given an image I in R^(MxN) we wish to classify I as an image containing faces or an image without faces.
I often work with pixel classifiers, where the task is something like: For an image I decide if each pixel is a face pixel or a non-face pixel.
An important part of defining the task is to make a hypotheses that can be used as basis for training a classifier. F.ex We believe that the distribution of pixel intensities can be used to discriminate images of faces from images not containing faces.
Then you need to select some features that define your image. This can be done in many ways and you should search for what other people do when they analyse the same type of images you are working with.
One widely used method in pixel classification is to use pixel intensity values and do a multi-scale analysis of the image. The idea in multi-scale analysis is that different structures are most evident at different level of blurring called scales. As an illustration consider an image of a tree. Without blurring we notice the fine structure, such as small branches and leafs. When we blur the image we notice the trunk and major branches. This is often used as part of segmentation methods.
When you know your task and the features, you can train a classifier. If you use kNN and cross-validation to find the best k, you should split you dataset in train/testing and then split the training set in train/validate sets. You then train using the reduced training set and use the validation set to decide which k is the best. In the case of binary classification e.g face vs non-face the error rate is often used as a measure of performance.
Finally you use the parameters to train the classifier on the full dataset and estimate its performance on the test set.
A classification example: With or without milk?
As a full example, consider images of a cup of coffee taken from above so it shows the rim of the cup surrounding a brownly colored disk. Further assume that all images are scaled and cropped so the diameter of the disk is the same and dimensions of the image are the same. To simplify the task, we convert the color image to grayscale and scale the pixel intensities to the range [0,1].
We want to train a classifier so it can distinguish coffee with milk from coffee without milk. From inspection of histograms of some of the coffee images, we see that each image has two "bumps" in the histogram that are clearly separated. We believe that these bumps correspond to foreground (coffee) and background. Now we make the hypothesis that the average intensity of the foreground can be used to distinguish between coffee+milk/coffee.
To find the foreground pixels we observe that because the foreground/background ratio is the same (by design) we can just find the intensity value that gives us that ratio for each image. Then we calculate the average intensity of the foreground pixels and use this value as a feature for each image.
If we have N images that we have manually labeled, we split this into training and test set. We then calculate the average foreground intensity for each image in the training set, giving us a set of (average foreground intensity, label) values. We want to use kNN where an image is assigned the same class as the majority class of the k closest images. We measure the distance as the absolute value of the difference in average foreground pixel intensity.
We search for the optimal k with cross validation. We use 2-fold cross validation (aka holdout) to find the best k. We test k = {1,3,5} and select the k that gives the least prediction error on the validation set.

Characters Recognition for Matlab Neural Network

I am working on my final project. I chose to implement a NN for characters recognition.
My plan is to take 26 images containg 26 English letters as training data, but I have no idea how to convert these images as inputs to my neural network.
Let's say I have a backpropagation neural network with 2 layers - a hidden layer and an output layer. The output layer has 26 neurons that produces 26 letters. I self created 26 images (size is 100*100 pixels in 24bit bmp format) that each of them contains a English letter. I don't need to do image segmentation, Because I am new to the image processing, so can you guys give me some suggestions on how to convert images into input vectors in Matlab (or do I need to do edge, morphology or other image pre-processing stuffs?).
Thanks a lot.
You NN will work only if the letters are the same (position of pixels is fixed). You need to convert images to gray-scale and pixelize them. In other words, use grid that split images on squares. Squares have to be small enough to get letter details but large enough so you don't use too much neurons. Each pixel (in gray scale) is a input for the NN. What is left is to determine the way to connect neurons e.g NN topology. Two layers NN should be enough. Most probably you should connect each input "pixel" to each neuron at first layer and each neuron at first layer to each neuron at second layer
This doesn't directly answer the questions you asked, but might be useful:
1) You'll want more training data. Much more, if I understand you correctly (only one sample for each letter??)
2) This is a pretty common project, and if it's allowed, you might want to try to find already-processed data sets on the internet so you can focus on the NN component.
Since you will be doing character recognition I suggest you use a SOM neural network which does not require any training data; You will have 26 input neurons one neuron for each letter. For the image processing bit Ross has a usefull suggestion for isolating each letter.