Tesseract OCR: How to find the read-error-magnitude of each returned character? - iphone

I 'm using Tesseract OCR engine in an iPhone application to read specific numeric fields from bill invoice photos.
Using a lot of photo pre-processing (adaptive thresholding, artifact cleaning, etc) the results are finally fairly accurate but there are still some cases I want to improve.
If the user takes a photo in low-light conditions and there is some noise or artifacts in the picture, the OCR engine interprets these artifacts as additional digits. In some rear cases it can read e.g. a numeric amount of "32,15" EUR as "5432,15" EUR and this is not at all good for the final user confidence in the product.
I assume that, if there is an internal OCR engine read-error associated to each character read, it will be higher on the "54" digits of my previous example as they are recognized over small noise-pixels, and if I had access to this reading-error values I will be able to easily discard the erroneous digits.
Do you know of any method to get a reading error magnitude (or any "accuracy factor" value) for each individual character returned from tesseract OCR engine?

It is called "confidence" value in Tesseract terminology. Search for that term in tesseract-ocr Group turned up many answers that mention about a TesserractExtractResult method.
The hOCR output also contains this value.

Related

NVIDIA DIGITS: Get accuracy for segmented images

I am trying to use NVIDIA DIGITS to perform image segmentation (by following this article). After training the model, I was able to perform the segmentation on individual images.
Now I would like to give a list of test images and labeled images to DIGITS, and I want it to calculate accuracy for the input images. Is it possible to do it using DIGITS, if yes, then what are the steps? (options to select on framework while training/testing model, format of input file, folder structure, etc.)
Also, if DIGITS can't do it, then are there any other frameworks? (I read this post, but it would be better if I can directly use some plug-and-play framework, to begin with)
Thanks!

Tesseract mixing up "1" and "7" despite training on exact font

I am using tesseract to get text from an image, I am only interested in numbers. I have trained tesseract and created a new language that is the exact font in the image and the training data only included numbers. In the training data I also included every possible value that would be in an image, 1-5000 to be specific and also created a wordlist of these same values. However it still mixes up 1 and 7, as well as sometimes 3 and 8. Does anybody have any recommendations on whether I should retrain differently or do some processing on the image before giving it to tesseract?
Make sure there are at least 20 instances of every character in the training texts you provide to tesseract. I give at least 6 pages of the same font to have a decent training sample size.
2.Tesseract Text Recognition also depends on the image quality. Check out possible preprocessing algorithms you can use: Improve Quality of Tesseract
Take a look at the number_dawg file. Modifying it can help recognising digits.

How to extract digits (number) using Matlab

At work I have to record a lot of data from png data. Every time I have to manually record the digits (e.g. mean\SD 101.1\11) on the excel sheet and read it with Matlab. Would it be possible that Matlab could directly read the digits from the PNG image, so that lots of work could be saved?
I know it might involve pattern recognition, but still hope that there may be someone who has done this before.
You can make use of Optical Character Recognition (OCR). The code for it is available here

What is the relation between OCR and Artificial Neural Network?

I saw different articles speaking about OCR form recognition (data extraction) and they said that they used Neural Network in order to do form recognition, so what's the relation between Artificial Neural network (ANN) and form recognition? If I want to extract fields from a BusinessCard, is it required to use ANN or is it optional? In other words when do I need to use ANN and when I don't?
It's a little different. ANN is just an "expert" in all OCR. But OCR engines contain many experts. When you study ANN you will build a simple OCR engine using just ANN but this does not compare to modern engines that use this in conjunction with tri-grams, morphology, data types ( very important for BCR and Forms ), dictionaries, connected components algorithm, etc. So look at it as just one of the tools in the bag of tricks to extract quality results. A good engine will incorporate ANN and all the others. In BCR there are additional considerations and it should be very heavy on connected components, dictionaries first, then use ANN and pattern matching for the actually recognition.
ANN is one way to perform OCR. There are others. Hence if you want to extract fields from a BusinessCard using ANN is only optional.
Good question. I recently spent some time playing with OCRopus, a Google project that does OCR - you can get it for free and play with it yourself. I'm pretty sure that it has an ANN as one of the modules behind it. However, the whole process of Optical Character Recognition can have many steps (lots of different little modules that each do something and pass the results to the next module).
So, here are some of the things I remember as being done by modules in that project:
There was a module that turned the image into black and white - this makes it easier for later modules to deal with.
Getting rid of speckles / spackles.
Straightening out the lines of text.
Breaking lines of text into individual words (it's been a few weeks, not sure about this one)
Basically, you can do the above using little bits of code that don't involve a neural net. So it's simpler doing it with these little bits of code.
The neural net I think is used just to recognize the individual characters - which character of a group of possible characters is it.
There's a training command in the OCRopus that I had running for over a week on end, and it kept sending line samples to the map, slowly changing the map as it went. I think it was training the ANN part.

How to segment text images using MATLAB?

It's part of the process of OCR,which is :
How to segment the sentences into words,and then characters?
What's the candidate algorithm for this task?
As a first pass:
process the text into lines
process a line into segments (connected parts)
find the largest white band that can be placed between each pair of segments.
look at the sequence of widths and select "large" widths as white space.
everything between white space is a word.
Now all you need a a good enough definition of "large".
First, NIST (Nat'l Institutes of Standards and Tech.) published a protocol known as the NIST Form-Based Handwriting Recognition System about 15 years ago for the this exact question--i.e., extracting and preparing text-as-image data for input to machine learning algorithms for OCR. Members of this group at NIST also published a number of papers on this System.
The performance of their classifier was demonstrated by data also published with the algorithm (the "NIST Handwriting Sample Forms.")
Each of the half-dozen or so OCR data sets i have downloaded and used have referenced the data extraction/preparation protocol used by NIST to prepare the data for input to their algorithm. In particular, i am pretty sure this is the methodology relied on to prepare the Boston University Handwritten Digit Database, which is regarded as benchmark reference data for OCR.
So if the NIST protocol is not a genuine standard at least it's a proven methodology to prepare text-as-image for input to an OCR algorithm. I would suggest starting there, and using that protocol to prepare your data unless you have a good reason not to.
In sum, the NIST data was prepared by extracting 32-bit x 32 bit normalized bitmaps directly from a pre-printed form.
Here's an example:
00000000000001100111100000000000
00000000000111111111111111000000
00000000011111111111111111110000
00000000011111111111111111110000
00000000011111111101000001100000
00000000011111110000000000000000
00000000111100000000000000000000
00000001111100000000000000000000
00000001111100011110000000000000
00000001111100011111000000000000
00000001111111111111111000000000
00000001111111111111111000000000
00000001111111111111111110000000
00000001111111111111111100000000
00000001111111100011111110000000
00000001111110000001111110000000
00000001111100000000111110000000
00000001111000000000111110000000
00000000000000000000001111000000
00000000000000000000001111000000
00000000000000000000011110000000
00000000000000000000011110000000
00000000000000000000111110000000
00000000000000000001111100000000
00000000001110000001111100000000
00000000001110000011111100000000
00000000001111101111111000000000
00000000011111111111100000000000
00000000011111111111000000000000
00000000011111111110000000000000
00000000001111111000000000000000
00000000000010000000000000000000
I believe that the BU data-prep technique subsumes the NIST technique but added a few steps at the end, not with higher fidelity in mind but to reduce file size. In particular, the BU group:
began with the 32 x 32 bitmaps; then
divided each 32 x 32 bitmap into
non-overlapping blocks of 4x4;
Next, they counted the number of
activated pixels in each block ("1"
is activated; "0" is not);
the result is an 8 x 8 input matrix
in which each element is an integer (0-16)
for finding binary sequence like 101000000000000000010000001
detect sequence 0000,0001,001,01,1
I am assuming you are using the image-processing toolbox in matlab.
To distinguish text in an image. You might want to follow:
Grayscale (speeds up things greatly).
Contrast enhancement.
Erode the image lightly to remove noise (scratches/blips)
Dilation (heavy).
Edge-Detection ( or ROI calculation).
With Trial-and-error, you'll get the proper coefficients such that the image you obtain after 5th step will contain convex regions surrounding each letter/word/line/paragraph.
NOTE:
Essentially the more you dilate, the larger element you get. i.e. least dilation would be useful in identifying letters, whereas comparitively high dilation would be needed to identify lines and paragraphs.
Online ImgProc MATLAB docs
Check out the "Examples in Documentation" section in the online docs or refer to the image-processing toolbox documentation in Matlab Help menu.
The examples given there will guide you to the proper functions to call and their various formats.
Sample CODE (not mine)