tesseract not recognize one number image - tesseract

i am using tesseract with python. It recognizes almost all of my images with 2 or more numbers or characteres.
But tesseract can't recognizes image with only one number.
I tried to use the command line, and it's giving me "empty page" as response.
I don't want to train tesseract with "only digits" because i am recognizing characters too.
What is the problem?
Below the image that its not recognized by tesseract.
Code:
#getPng(pathImg, '3') -> creates the path to the figure.
pytesseract.image_to_string( Image.open(getPng(pathImg, '3'))

If you add the parameter --psm 13 it should works, because it will consider it as a raw text line, without searching for pages and paragraphs.
So try:
pytesseract.image_to_string(PATH, config="--psm 13")

Try converting image into gray-scale and then to binary image, then most probably it will read.
If not duplicate the image , then you have 2 letters to read. So simply you can extract single letter

Based on ceccoemi answer you could try other page segmentation modes (--psm flag).
For this special case I suggest using --psm 7 (single text line) or --psm 10 (single character):
psm7 = pytesseract.image_to_string(Image.open(getPng(pathImg, '3'), config='--psm 7')
psm10 = pytesseract.image_to_string(Image.open(getPng(pathImg, '3'), config='--psm 10')
More information about these modes can be found in the tesseract wiki.

You can use -l osd for single digit like this.
tesseract VYO0C.png stdout -l osd --oem 3 --psm 6
2

Related

How to implement Unicode (UTF-8) support for a CID-keyed font (Adobe's Type 0 Composite font) converted from ttf?

This post is sequel to Conversion from ttf to type 2 CID font (type 42 base font)
It is futile to have a CID-Keyed font (containingCIDMap that enforces Identity Mapping i.e Glyph index = CID) without offering Unicode support inherently. Then, how to provide UTF-8 support for such a CID-keyed font externally by an application software?
Note: The application program that uses the CID-keyed font can be written in C, C++, Postscript or any other language.
The CID-keyed font NotoSansTamil-Regular.t42 has been converted from Google's Tamil ttf font.
You need this conversion because without this conversion, a postscript program can't access a truetype font!
Refer Post Conversion from ttf to type 2 CID font (type 42 base font) for conversion.
The CIDMap of t42 font enforces an identity mapping as follows:
Character code 0 maps to Glyph index 0
Character code 1 maps to Glyph index 1
Character code 2 maps to Glyph index 2
......
......
Character code NumGlyphs-1 maps to Glyph index NumGlyphs-1
It is clearly evident that there is no Unicode involved in this mapping inherently.
To understand concretely, edit the following postscript program tamil.ps that accesses t42 font through postscript's findfont operator.
%!PS-Adobe-3.0
/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def
13 myNoTo
100 600 moveto
% தமிழ் தங்களை வரவேற்கிறது!
<0019001d002a005e00030019004e00120030002200030024001f002f0024005b0012002a0020007a00aa> show
100 550 moveto
% Tamil Welcomes You!
<0155017201aa019801a500030163018801a5017f01b101aa018801c20003016901b101cb00aa00b5> show
showpage
Issue the following Ghostscript command to execute the postscript program tamil.ps.
gswin64c.exe "D:\cidfonts\NotoSansTamil-Regular.t42" "D:\cidfonts\tamil.ps (on Windows Platform).
gs ~/cidfonts/NotoSansTamil-Regular.t42 ~/cidfonts/tamil.ps (on Linux Platform).
This will display two strings தமிழ் தங்களை வரவேற்கிறது! and Tamil Welcomes You! respectively in subsequent rows.
Note that the strings for show operator are in Hexadecimal format embedded within angular brackets. Operator show extracts 2 bytes at a time and maps this CID (16 bit value) to a Glyph.
For example, the first 4 Hex digits in the 1st string is 0019 whose decimal equivalent is 25. This maps to glyph த.
In order to use this font t42, each string (created from character set of a ttf) should be converted into hexadecimal string by hand which is practically impossible and therefore this font becomes futile.
Now consider the following C++ code that generates a postscript program called myNotoTamil.ps that accesses the same t42 font through postscript's findfont operator.
const short lcCharCodeBufSize = 200; // Character Code buffer size.
char bufCharCode[lcCharCodeBufSize]; // Character Code buffer
FILE *fps = fopen ("D:\\cidfonts\\myNotoTamil.ps", "w");
fprintf (fps, "%%!PS-Adobe-3.0\n");
fprintf (fps, "/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def\n");
fprintf (fps, "13 myNoTo\n");
fprintf (fps, "100 600 moveto\n");
fprintf (fps, u8"%% தமிழ் தங்களை வரவேற்கிறது!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"தமிழ் தங்களை வரவேற்கிறது!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "%% Tamil Welcomes You!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"Tamil Welcomes You!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "showpage\n");
fclose (fps);
Although the contents of tamil.ps and myNotoTamil.ps are same and identical, the difference in the production of those ps files is like difference between heaven and earth!
Observe that unlike tamil.ps(handmade Hexadecimal strings), the myNotoTamil.ps is generated by a C++ program which uses UTF-8 encoded strings directly hiding the hex strings completely. The function strps produces hex strings from UTF-8 encoded strings which are the same and identical as the strings present in tamil.ps.
The futile t42 font has suddenly become fruitful due to strps function's mapping ability from UTF-8 to CIDs (every 2 bytes in Hex strings maps to a CID)!
The strps function consults a mapping table aNotoSansTamilMap (implemented as a single dimensional array constructed with the help of Unicode Blocks) in order to map Unicode Points (extracted from UTF-8 encoded string) to Character Identifiers (CIDs).
The buffer bufCharCode used in strps function (4th parameter) passes out hex strings corresponding to UTF-8 encoded strings to Postscript's show operator.
In order to benefit others, I released this UTF8Map program through GitHub on the following platforms.
Windows 10 Platform (Github Public Repository for UTF8Map Program on Windows 10)
Open up DOS command line and issue the following clone command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Win
Or execute the following curl command to download source code release in zip form:
curl -o UTF8Map-Win-2.0.zip -L https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Or execute the following wget command to download source code release in zip form:
wget -O UTF8Map-Win-2.0.zip https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Linux Platform (Github Public Repository for UTF8Map Program on Linux)
Issue the following clone command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Linux
Or execute the following curl command to download source code release in tar form:
curl -o UTF8Map-Linux-1.0.tar.gz -L https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Or execute the following wget command to download source code release in tar form:
wget -O UTF8Map-Linux-1.0.tar.gz https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Note:
This program uses t42 file to generates a ps file (a postscript program) which will display the following in a single page:
A welcome message in Tamil and English.
List of Vowels (12 + 1 Glyphs). All of them are associated with Unicode Points.
List of Consonants (18 + 6 = 24 Glyphs). No association of Unicode Points.
List of combined glyphs (Combination of Vowels + Consonants) in 24 lines. Each line displays 12 glyphs. Out of 288 Glyphs, 24 are associated with Unicode Points and rest do not.
List of Numbers in two lines. All 13 Glyphs for Tamil numbers are associated with Unicode Points.
A foot Note.
The two program files (main.cpp and mapunicode.h) are 100% portable. i.e. the contents of two files are same and identical across platforms.
The two mapping tables aNotoSansTamilMap and aLathaTamilMap are given in mapunicode.h file.
A README Document in Markdown format has been included with the release.
This software has been tested for t42 fonts converted from the following ttf files.
Google's Noto Tamil ttf
Microsoft`s Latha Tamil ttf

why tesseract can't recognize the english words on this image?

I am using tesseract 4.0 to recognize english words,but fail only on this image ,without any words been recognized,
any one can give a tip,thanks
r=pytesseract.image_to_string('6.jpg', lang='eng')
print(r)
Fail image
update:
I try to OCR with online website
https://www.newocr.com/
and it works,but why?
how can I use tesseract to recognize it?
The problem is pytesseract is not rotation-invariant. Therefore, you need to do additional pre-processing. source
My first idea is to rotate the image with a small angle
img = imutils.rotate_bound(cv2.imread("YD90o.png"), 4)
Result:
Now if I apply an adaptive-threshold
To read with pytesseract you need to set additional configuration:
pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
PSM (page-segmentation-mode) 6 is Assume a single uniform block of text. source
Result:
You want to get the last sentence of the image.
txt = pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
txt = txt.replace('\f', '').split('\n')
print(txt[len(txt)-2])
Result:
Continue Setub ie Gene
The website might use deep-learning method to detect the words in the image. But when I use newocr.com the result is:
oy Eee a
setuP me -
continve ae

Change the color of specific letter in console

I am forming an specific string using several strcat and displaying it into console. This string contains characters such as: 1,2,3,4,5,6,7,8,9,0,#,*,E and am using fprintf('%s') for this purpose.
For instance:
2E4137E65922#
is a possible outcome of the code.
Is there anyway I could make letter E to stand out in my output? Like making it red?
Unfortunatedly, there is no official way of doing this. However, you could use Yari Altman's cprintf(). It abuses of undocumented features of Matlab to do exactly what you want.
You can read more in the famous Undocumented Matlab blog he runs.
The example image in FEX looks like this:
EDIT: Theoretically, if cprintf would work as expected, the following should work:
C=strsplit(s,'E');
cprintf('black',C{1});
for ii=2:size(C,2)
cprintf('err','E');
cprintf('black',C{ii});
end
cprintf('black','\n');
However, in Matlab 2014b it doesnt give good results. I found out that of it doesnt work properly when there is a single character to format.
If you substitute 'E' by 'EE' works....
EDIT2: I left a comment to Yari Altman. Hopefully he will, if he can, fix the thing.
You can use the HTML tags <strong>, </strong> to type specific letters in bold:
str = '2E4137E65922#'; %// input string
letter = 'E'; %// letter that should be made bold
strBold = regexprep(str, letter, ['<strong>' letter '</strong>']); %// output string
disp(str)
disp(strBold)
Thanks #Dev -iL for this information!
While it seems that cprinf() from my other answer does not work for single characters, if there is a single color that one wants to use, and that color is orange, then this trick used for warning in cprintf can be used:
disp(['this is [' 8 'orange]' 8 ' text'])
Read more at: http://undocumentedmatlab.com/blog/another-command-window-text-color-hack
Thus, your code would look like:
s='2E4137E65922#';
C=strsplit(s,'E');
str=C{1};
for ii=2:size(C,2)
str=[str ['[' 8 'E]' 8 ]];
str=[str C{ii}];
end
disp(str);

Is there a way to use tesseract for single digit numbers?

TL;DR It appears that tesseract cannot recognize images consisting of a single digit. Is there a workaround/reason for this?
I am using (the digits only version of) tesseract to automate inputting invoices to the system. However, I noticed that tesseract seems to be unable to recognize single digit numbers such as the following:
The raw scan after crop is:
After I did some image enhancing:
It works fine if it has at least two digits:
I've tested on a couple of other figures:
Not working:
,
,
Working:
,
,
If it helps, for my purpose all inputs to tesseract has been cropped and rotated like above. I am using pyocr as a bridge between my project and tesseract.
Here's how you can configure pyocr to recognize individual digits:
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
im = Image.open('digit.png')
builder = pyocr.builders.DigitBuilder()
# Set Page Segmentation mode to Single Char :
builder.tesseract_layout = 10 # If tool = tesseract
builder.tesseract_flags = ['-psm', '10'] # If tool = libtesseract
result = tool.image_to_string(im, lang="eng", builder=builder)
Individual digits are handled the same way as other characters, so changing the page segmentation mode should help to pick up the digits correctly.
See also:
Tesseract does not recognize single characters
Set PageSegMode to PSM_SINGLE_CHAR

Zebra Programming Language (ZPL) II using ^FB or ^TB truncates text at specific lenghts

I am writing code to print labels for botanic gardens. Each label is printed individually but with different information on each label. Each label contains a scientific name which can vary greatly in size and thus can go over 2 lines (our label size is 10cm wide by 2.5cm high).
Our problem occurs mainly with the name when we go over 24 characters (See line with **).
If we choose a name that has 24 characters or less then it prints fine.
Anything more it will not print.
If we take all the other "items" off the label and just leave the "name" element then it prints only the first 24 characters and truncates the rest (we did this to test whether a possible overlap between our ^FB block and another element could be causing this problem).
We tried this with other elements that use a ^FB and we found that they displayed the same behaviour but varied in the length at which this issue occurred: for example "cc" (short for country code) had a limit of 21 characters.
For added information: we compile this code within a BASIC environment and use variables such as ":name:", ":Acc.dt":" as seen bellow. Our database provides this information and we have checked for any internal routines that would have truncated long names etc. Our code was working fine in ZPL but we recently had to move to ZPL II (we purchased a newer model GX430t) and had to modify our ZPL code at which point this problem started to occur.
Here is our code:
^XA
^LH40,40
^MMT
^PW1200
^LL1200
^FO16,05^A0N,35,^FDAcc. num.^FS
^FO170,05^A0,35,^FV":accnum:"^FS
^FO360,05^A0,35,^FV":qual:"^FS
^FO350,35^A0N,30,^FDAcc.dt.^FS
^FO450,35^A0N,30,^FB790,3,0,L,
^FH\^FV":accdt:"^FS
^FO430,70^^A0N,25,^FB790,3,0,L,
^FH\^FDProv. type^FS
^FO560,70^A0N,25,^FV":provtype:"^FS
^FO800,225^A0N,30,^FB790,3,0,L,
^FV":cc:"^FS
**^FO10,100^A0N,40,^FB790,3,0,L,
^FV":name:"^FS**
^FO1000,05^A0,35,^FV":proptype:"^FS
^FO5,225^A0,25^FVColl.^FS
^FO55,225^A0,25^FV":coll:"^FS
^FO375,225^A0,25,^FV":consstat:"^FS
^FO1000,70^A0,25,^FV":reqby:"^FS
^FO535,180^BCN,55,N,N,N^FV":qual:"^FS
^FO60,45^BCN,35,N,N,N^FV":accnum:"^FS
^PQ1,0,1,Y
^XZ
Here is what we have tried to fix this (apologies if some seem like wild cards):
Changing font type, size, and location on label;
Changing ^FO to ^FT;
Looked at our internal database logic;
Taking away ^FH\;
Changing the values within the ^FB line (we tried nearly all possible permutations);
Manually typed in a name longer than 24 characters (using notepad - no database/compiler) - same issue.
Any thoughts on this would be greatly appreciated
Kerry
I've had this issue before, and across printer manufacturers, firmwares and languages.
First, some paraphrased explanations straight out of the 2014 ZPL II Programming Guide (P1012728-009 Rev. A).
"The ^TB command prints a text block with defined width and height. The text block has an automatic word-wrap function. If the text exceeds the block height, the text is truncated."
"The ^FB (Field Block) command allows you to print text into a defined block type format. It can format a ^FD (Field Data) string into a block of text using the origin, font, and
rotation specified for the text string, and it contains an automatic word-wrap function."
Technically, the difference between a text block and a field block is that height is in dots for the former and in lines for the latter.
Also notice that although not mentioned, the ^FB command also truncates text that does not fit in the number of lines specified, and here's where the font size of the A0 command and the line spacing of the FB command now play an important role in determining whether to show or truncate that second or third line.
Incidentally, in other languages such as TSPL there is no truncation of text blocks--if you tell the block to be 3 lines in height but there's enough text for 4 lines, line 4 overlaps line 3 to indicate this--which may seem awful, but it is better than the data loss of truncation, which is not obvious.
For both commands:
"Using ^FT (Field Typeset) for your data takes the baseline origin of
the last possible line of text, meaning that the field block will be
filled from bottom to top."
"Using ^FO (Field Origin) means that the field block will be filled from top to bottom."
In reality, I have only been able to make the ^FB command work as expected, but that may be because ^TB is not implemented in the firmware I've worked with (ZPL II "compliant" Bluetooth printers).
You can test the following snippet for a 2x2 label in the Labelary Viewer:
^XA
~TA0
^MTD
^MNW
^MMT
^MFN
~SD15
^PR6
^PON
^PMN
^PW406
^LS0
^LRN
^LL406
^LT0
^LH0,0
^CI0
^XZ
^XA
^FO324,10,0^FB386,2,0,C,0^A0R,36,28.8^FH^FD"The King" Cupcake^FS
^FO278,10,0^FB386,1,0,C,0^A0R,28,22.4^FH^FDUse By 11/24/2015 02:45 PM^FS
^FO152,10,0^FB386,1,0,C,0^A0R,24,19.2^FH^FD11/24/2015 02:45 PM^FS
^FO62,140,0^FB250,1,0,R,0^A0R,24,19.2^FH^FDSL: 4 hours^FS
^FO38,10,0^FB386,1,0,L,0^A0R,18,14.4^FH^FDPREP DATE:^FS
^FO8,10,0^FB386,1,0,L,0^A0R,28,22.4^FH^FD11/24/2015 10:45 AM^FS
^FO62,10,0^FB50,1,0,L,0^A0R,24,19.2^FH^FDEMP:^FS
^FO92,10,0^FB376,3,0,J,0^A0R,18,14.4^FH^FDIngredients: 1 1/2 cups all-purpose flour, 1 teaspoon baking powder, 1/2 teaspoon salt, 8 tablespoons (1 stick) unsalted butter, room temperature, 1 cup sugar, 3 large eggs, 1 1/2 teaspoons pure vanilla extract, 3/4 cup milk.^FS
^PQ3,,,Y
^XZ
In particular, I've preceeded the A0 and FD commands with FB. Using the viewer, you can quickly test the effects of changing from FT and FO in the ingredients line, the effects of changing the A0 font sizes and the effects of changing the FB number of lines from say 3 to 2 (the viewer does not truncate text btw).
Of course there is no match for actually printing a label, for your ZPL II "compliant" printer may or may not truncate text according to its manufacturer and firmware version.
I hope that helps!