why tesseract can't recognize the english words on this image? - tesseract

I am using tesseract 4.0 to recognize english words,but fail only on this image ,without any words been recognized,
any one can give a tip,thanks
r=pytesseract.image_to_string('6.jpg', lang='eng')
print(r)
Fail image
update:
I try to OCR with online website
https://www.newocr.com/
and it works,but why?
how can I use tesseract to recognize it?

The problem is pytesseract is not rotation-invariant. Therefore, you need to do additional pre-processing. source
My first idea is to rotate the image with a small angle
img = imutils.rotate_bound(cv2.imread("YD90o.png"), 4)
Result:
Now if I apply an adaptive-threshold
To read with pytesseract you need to set additional configuration:
pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
PSM (page-segmentation-mode) 6 is Assume a single uniform block of text. source
Result:
You want to get the last sentence of the image.
txt = pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
txt = txt.replace('\f', '').split('\n')
print(txt[len(txt)-2])
Result:
Continue Setub ie Gene
The website might use deep-learning method to detect the words in the image. But when I use newocr.com the result is:
oy Eee a
setuP me -
continve ae

Related

How can I manage display and spacing on a Crystal Report where I have to display images between the text field?

I have a field that I'm displaying on a report that is a combination of text and codes that represent an image. Some of those icons have ascii symbols that I've used a replace formula to display them as their ascii version. For two or three of the images, I have no luck and have to display a mini picture for the representation.
The codes being sent are something like:
^he^ = ♥ ^st^ = ⭐ ^cl^ = 🍀 etc...
So for the clover leaf, there is no emoji support in my version of Crystal for clover leaves, and the ascii icon I found online for it just shows the empty square icon when an emoji isn't supported.
My workaround for this is to have a formula that converts all my icons to the appropriate ascii where supported, and to leave two blank spaces for the unsupported icons.
>stringvar gift_msg;
>gift_msg:= {DataTable1.gift_field};
>gift_msg := replace(gift_msg,"^CL^"," ");
>gift_msg := replace(gift_msg,"^HE^","♥");
>gift_msg := replace(gift_msg,"^ST^","★");
>gift_msg
I then put a suppression formula on each image that looks like this:
>mid({DataTable1.gift_field},2,4)<>"^CL^"
So I duplicated the image along the length of the field and increment the mid formula to match the field. I also set the font to Consolas so that it's fixed width to remove any surprises in spacing. My issue is that this still creates very strange spacing, and I'm almost certain there's a much easier way to do this.
One option is to use a free service such as Calligraphr.com to convert your image to a font.
Given that your image relies on several colors, the font option might not work.
Another option is to build the expression as html with image source directives where you need them. You would then need a create or use a 3rd-party UFL to convert the full expression to an image that you can load on the fly using the Graphic Location expression. At least one of the UFLs listed by Ken Hamady here provides such a function.

How extract text geometry using PyPDF2?

I have pdf documents.
And it's clear to me how to extract text from it.
I need to extract not only text but also coordinates associated with this text.
It's my code:
from PyPDF2 import PdfReader
pdf_path = 'docs/doc_3.pdf'
pdf = PdfReader(pdf_path)
page_1_object = pdf.getPage(1)
page_1_object.extractText().split("\n")
The result is:
['Creating value for all stakeholders',
'Anglo\xa0American is re-imagining mining to improve people’s lives.']
I need geometries associated with extracted paragraphs.
Might be something like this for example:
['Creating value for all stakeholders', [1,2,3,4,]]
'Anglo\xa0American is re-imagining mining to improve people’s lives.', [7,8,9,10]]
How I can accomplish it?
Thanks,
Currently that ability is not a PyPDF2 feature, it has the ability for parsing the content as you show extractText() but does not hold the separate glyph xy positions nor output the lines coordinates.
There are other means in python to extract a single or multiple groups of letters that form words.
Using shell commands such as poppler from / in conjunction with a text "word" from PyPDF2 is possible, however the norm would be to run with another Py PDF Lib such as PyMuPDF and here is such an article, https://pyquestions.com/find-text-position-in-pdf-file for highlighting with PyMuPDF input.
The most common means to your goal is probably as described here How to extract text and text coordinates from a PDF file?

tesseract not recognize one number image

i am using tesseract with python. It recognizes almost all of my images with 2 or more numbers or characteres.
But tesseract can't recognizes image with only one number.
I tried to use the command line, and it's giving me "empty page" as response.
I don't want to train tesseract with "only digits" because i am recognizing characters too.
What is the problem?
Below the image that its not recognized by tesseract.
Code:
#getPng(pathImg, '3') -> creates the path to the figure.
pytesseract.image_to_string( Image.open(getPng(pathImg, '3'))
If you add the parameter --psm 13 it should works, because it will consider it as a raw text line, without searching for pages and paragraphs.
So try:
pytesseract.image_to_string(PATH, config="--psm 13")
Try converting image into gray-scale and then to binary image, then most probably it will read.
If not duplicate the image , then you have 2 letters to read. So simply you can extract single letter
Based on ceccoemi answer you could try other page segmentation modes (--psm flag).
For this special case I suggest using --psm 7 (single text line) or --psm 10 (single character):
psm7 = pytesseract.image_to_string(Image.open(getPng(pathImg, '3'), config='--psm 7')
psm10 = pytesseract.image_to_string(Image.open(getPng(pathImg, '3'), config='--psm 10')
More information about these modes can be found in the tesseract wiki.
You can use -l osd for single digit like this.
tesseract VYO0C.png stdout -l osd --oem 3 --psm 6
2

Is there a way to use tesseract for single digit numbers?

TL;DR It appears that tesseract cannot recognize images consisting of a single digit. Is there a workaround/reason for this?
I am using (the digits only version of) tesseract to automate inputting invoices to the system. However, I noticed that tesseract seems to be unable to recognize single digit numbers such as the following:
The raw scan after crop is:
After I did some image enhancing:
It works fine if it has at least two digits:
I've tested on a couple of other figures:
Not working:
,
,
Working:
,
,
If it helps, for my purpose all inputs to tesseract has been cropped and rotated like above. I am using pyocr as a bridge between my project and tesseract.
Here's how you can configure pyocr to recognize individual digits:
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
im = Image.open('digit.png')
builder = pyocr.builders.DigitBuilder()
# Set Page Segmentation mode to Single Char :
builder.tesseract_layout = 10 # If tool = tesseract
builder.tesseract_flags = ['-psm', '10'] # If tool = libtesseract
result = tool.image_to_string(im, lang="eng", builder=builder)
Individual digits are handled the same way as other characters, so changing the page segmentation mode should help to pick up the digits correctly.
See also:
Tesseract does not recognize single characters
Set PageSegMode to PSM_SINGLE_CHAR

tesseract-ocr - To extract table

I am new to tesseract ocr. I am making use of Google api to extract words and lines from image. I want to extract tables/horizontal & vertical lines. I tried with FindLinesCreateBlockList method. It returns BLOCK_LIST type. I am not aware how to print values from BLOCK_LIST.
Is FindLinesCreateBlockList the right method to extract tables/lines?