Tesseract-OCR only detects words but not single characters

Tesseract-OCR only detects words but not single characters - tesseract

As the title says, I've been trying to do live recognition of a Single Character of an image taken with webcam (which contains only that character, not other words or characters), but I couldn't get any results. The machine is an RPi3 with 1GB of RAM and Rasp OS 64-bit, therefore I can't use libraries such as easyocr due to the heavy load on RAM and CPU.
What I've tried:
I've always applied a threshold to the image (using OTSU and Binary Mask) (example image containing all 3 characters I have to recognize), but no results. I also tried to pass the image to tesseract with the Page Segmentation Module changed to 10 in order to treat the image as a single character, still no results.

Related

JBIG2 encoder giving lower compression ratio as compared to JBIG1?

I am currently working on the compression of binary images. There is an algorithm that I am trying to test and compare its compression ratio to JBIG2 encoded images.
I want to compute the compression ratio of specific images using my algorithm and computer them to standards JBIG1 and JBIG2. For JBIG1, I am using JBIG-kit (from Markus K) and agl's jbig2enc for JBIG2 compression. However, the size of the encoded JBIG2 file (.jb2) is similar to that of the JBIG1 (.jbg) encoder. This should not happen as JBIG2 should give at least 2-3 times smaller results.
I am giving a .pbm image as input for both encoders, and in some cases, the JBIG1 encoder is giving me a lower size.
For generating the jb2 encoded file I am using the command:
$ jbig2 -s a.pbm > a.jb2
The version that my computer has currently installed is jbig2enc 0.28.
For compression ratio, I am directly using the sizes of .jbg and .jb2 files.
So, please let me know if I am doing something wrong here.

Are characters made up of pixels?

On a computer screen, are the characters made up of pixels? If so, it means that characters are images!
And if the characters are made up of pixels, then why are there ASCII, UNICODE and other standards that associate binary digits to different character formats, but there are no standards that associate image formats with binary digits? Because if both are made up of pixels (characters and images), what is the difference between them?

No, #1: Characters are not "on a computer screen". What goes on the screen is the result of all kinds of rendering and painting and combining onto a 2-D grid of pixels.
No, #2: Unicode characters are independent of the specific fonts used to present them graphically. So, with one font, a character will end up producing certain pixels, and with another font - other pixels altogether.
No, #3: Character strings are held in your computer's memory as sequences of bytes, i.e. numeric values (with each character typically occupying one byte, or two, or a variable number of bytes).

On a computer screen, are the characters made up of pixels? if so, it means that characters are images!
On a typical modern screen, yes the graphical representation of a character is a group of pixels. No, computers don't always have a screen
For example in the past people used to interact with computers via multiple types of terminals like a mechanical terminal where the texts are printed directly to paper. Or sometimes a vector screen or a 16-segment/14-segment display is also used where the text representation has no pixel at all. Many computers don't even have a screen or a way to display characters and interact with humans via switches, LEDs, punched cards, network or serial port...
So the premise of the question is already wrong. Characters has nothing to do with pixels. Even when displaying characters on the screen then the pixels representing a character also vary depending on the font face and font size
Character traditionally means a symbol or a glyph representing something. In computing character means a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language. None of them says anything about pixels
Each language has a known set of symbols, so logically they're grouped together and each assigned a number. The whole set of those numbers and their mappings is called a character set. You can see that it makes sense to associate numbers with characters but doing the same for images make no sense. What are the common thing in images that we can map?
In the past there were no need to cooperate with people using other languages so each group of people chose a small set that works for their own language. However with the advent of portable devices and the internet, that doesn't work anymore. It'll be extremely awkward to receive a message that you can't read, or send an email that the customer sees as a bunch of garbage. That's why a bigger character set called Unicode was invented
However character set is just a way to map numbers to glyphs in computers. To deal with characters we also need a way to encode those numbers which is called character encoding. For example in a variable length encoding a long number may be encoded using more bytes. Unicode has multiple encodings like UTF-1, UTF-7, UTF-8, UTF-16 or UTF-32

Tesseract 4 : is there a maximum resolution that an input image can have when page segmentation is enabled?

My Java project deals with OCRing pdfs to index them. Each pdf page is converted into a png which is then piped to tesseract 4.
The pdf->png conversion uses renderImageWithDPI from PDFBox PdfRenderer :
buffImage = pdfRenderer.renderImageWithDPI(currentPage,
PNG_DENSITY,
ImageType.GRAY);
with PNG_DENSITY = 300 as advised on tesseract's wiki to get best results.
The OCR command is
The command used for tesseract is
tesseract input.png output -l fra --psm 1 --oem 1
I also tryed --psm 2 or 3 which also involve page segmentation ie
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
With a scanned PDF (producer/creator is Adobe Acrobat 7.0, which involves copyrighted content so I can't share it) of 146 pages, tesseract makes endless computations (the process never ends) on a given page (85).
As it was too long to test (ie : wait until page 85 gets OCRed), I decided to generate an extract of this pdf with Evince "print to file" feature.
Now the pdf produced by Evince (producer/creator is cairo 1.14.8), Tesseract handles it successfully (ie the image gets OCRed).
The difference is the image resolution. The image that fails is 4991x3508 pixels whereas the one that succeeds is only 3507x2480 pixels.
Please note : tesseract in "Sparse text with OSD" (ie --psm 12) handles the page "successfully" although the text (on 2 columns) is not understandable (ie the 2 columns are mixed)
EDIT after several trials and errors
It looks like the input image has to have a width strictly less than 4000 pixels to work with page segmentation. Looking at Tesseract source code, in a class called "pgedit" the canvas size seems limited to 4000 x 4000 as the constructor of a "ScrollView" (for whatever it serves) is :
ScrollView::ScrollView(const char* name, int x_pos, int y_pos, int x_size,
int y_size, int x_canvas_size, int y_canvas_size, bool y_axis_reversed)
So my question now is, why is there a limit of 4000 pixels wide / high to use page segmentation, and what should I do if a pdf page converted to png at 300dpi exceeds 4000 pixels (either wide or high or both) ?
Any help appreciated,

How big image can be read and displayed in matlab?

I have a problem with image reading. I want to make sure how big image can be read and displayed in matlab? It is possible to display huge images like (12689,4562,7). If not, how can I check whether this image loaded correctly in matlab?
Thanks a lot

There are two questions here:
Is it possible to load a large image from the disk to RAM?
Is it possible to show a large image?
The answer to the first question is that it depends on your amount of RAM and operating system. The answer to the second question is that Matlab (or any program) downscales the image before showing, since there aren't that much pixels on the image. So it depends on the internal algorithm, and again, on your amount of RAM.

The number of MB of RAM required for such an image would be (assuming 8 bits/pixel (uint8)):
12689*4562*7 / 1e6 = 405.2 MB
The number of elements a single matrix can contain in your version of Matlab:
[~, numEls] = computer;
which is 2.147483647e+09 on my 32-bit R2010b. This is much more than 12689*4562*7, so in principle, if you have 406GB of unused RAM, you should be able to load the image in its entirety into RAM. And in principle, displaying said image will involve some additional RAM (and probably take a looong time), but should nevertheless be possible (aside from the fact that displaying an image with 7 colour layers is not very standard AFAIK).

How to segment text images using MATLAB?

It's part of the process of OCR,which is :
How to segment the sentences into words,and then characters?
What's the candidate algorithm for this task?

As a first pass:
process the text into lines
process a line into segments (connected parts)
find the largest white band that can be placed between each pair of segments.
look at the sequence of widths and select "large" widths as white space.
everything between white space is a word.
Now all you need a a good enough definition of "large".

First, NIST (Nat'l Institutes of Standards and Tech.) published a protocol known as the NIST Form-Based Handwriting Recognition System about 15 years ago for the this exact question--i.e., extracting and preparing text-as-image data for input to machine learning algorithms for OCR. Members of this group at NIST also published a number of papers on this System.
The performance of their classifier was demonstrated by data also published with the algorithm (the "NIST Handwriting Sample Forms.")
Each of the half-dozen or so OCR data sets i have downloaded and used have referenced the data extraction/preparation protocol used by NIST to prepare the data for input to their algorithm. In particular, i am pretty sure this is the methodology relied on to prepare the Boston University Handwritten Digit Database, which is regarded as benchmark reference data for OCR.
So if the NIST protocol is not a genuine standard at least it's a proven methodology to prepare text-as-image for input to an OCR algorithm. I would suggest starting there, and using that protocol to prepare your data unless you have a good reason not to.
In sum, the NIST data was prepared by extracting 32-bit x 32 bit normalized bitmaps directly from a pre-printed form.
Here's an example:
00000000000001100111100000000000
00000000000111111111111111000000
00000000011111111111111111110000
00000000011111111111111111110000
00000000011111111101000001100000
00000000011111110000000000000000
00000000111100000000000000000000
00000001111100000000000000000000
00000001111100011110000000000000
00000001111100011111000000000000
00000001111111111111111000000000
00000001111111111111111000000000
00000001111111111111111110000000
00000001111111111111111100000000
00000001111111100011111110000000
00000001111110000001111110000000
00000001111100000000111110000000
00000001111000000000111110000000
00000000000000000000001111000000
00000000000000000000001111000000
00000000000000000000011110000000
00000000000000000000011110000000
00000000000000000000111110000000
00000000000000000001111100000000
00000000001110000001111100000000
00000000001110000011111100000000
00000000001111101111111000000000
00000000011111111111100000000000
00000000011111111111000000000000
00000000011111111110000000000000
00000000001111111000000000000000
00000000000010000000000000000000
I believe that the BU data-prep technique subsumes the NIST technique but added a few steps at the end, not with higher fidelity in mind but to reduce file size. In particular, the BU group:
began with the 32 x 32 bitmaps; then
divided each 32 x 32 bitmap into
non-overlapping blocks of 4x4;
Next, they counted the number of
activated pixels in each block ("1"
is activated; "0" is not);
the result is an 8 x 8 input matrix
in which each element is an integer (0-16)

for finding binary sequence like 101000000000000000010000001
detect sequence 0000,0001,001,01,1

I am assuming you are using the image-processing toolbox in matlab.
To distinguish text in an image. You might want to follow:
Grayscale (speeds up things greatly).
Contrast enhancement.
Erode the image lightly to remove noise (scratches/blips)
Dilation (heavy).
Edge-Detection ( or ROI calculation).
With Trial-and-error, you'll get the proper coefficients such that the image you obtain after 5th step will contain convex regions surrounding each letter/word/line/paragraph.
NOTE:
Essentially the more you dilate, the larger element you get. i.e. least dilation would be useful in identifying letters, whereas comparitively high dilation would be needed to identify lines and paragraphs.
Online ImgProc MATLAB docs
Check out the "Examples in Documentation" section in the online docs or refer to the image-processing toolbox documentation in Matlab Help menu.
The examples given there will guide you to the proper functions to call and their various formats.
Sample CODE (not mine)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse