Is there a way to use tesseract for single digit numbers? - tesseract

TL;DR It appears that tesseract cannot recognize images consisting of a single digit. Is there a workaround/reason for this?
I am using (the digits only version of) tesseract to automate inputting invoices to the system. However, I noticed that tesseract seems to be unable to recognize single digit numbers such as the following:
The raw scan after crop is:
After I did some image enhancing:
It works fine if it has at least two digits:
I've tested on a couple of other figures:
Not working:
,
,
Working:
,
,
If it helps, for my purpose all inputs to tesseract has been cropped and rotated like above. I am using pyocr as a bridge between my project and tesseract.

Here's how you can configure pyocr to recognize individual digits:
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
im = Image.open('digit.png')
builder = pyocr.builders.DigitBuilder()
# Set Page Segmentation mode to Single Char :
builder.tesseract_layout = 10 # If tool = tesseract
builder.tesseract_flags = ['-psm', '10'] # If tool = libtesseract
result = tool.image_to_string(im, lang="eng", builder=builder)

Individual digits are handled the same way as other characters, so changing the page segmentation mode should help to pick up the digits correctly.
See also:
Tesseract does not recognize single characters

Set PageSegMode to PSM_SINGLE_CHAR

Related

How extract text geometry using PyPDF2?

I have pdf documents.
And it's clear to me how to extract text from it.
I need to extract not only text but also coordinates associated with this text.
It's my code:
from PyPDF2 import PdfReader
pdf_path = 'docs/doc_3.pdf'
pdf = PdfReader(pdf_path)
page_1_object = pdf.getPage(1)
page_1_object.extractText().split("\n")
The result is:
['Creating value for all stakeholders',
'Anglo\xa0American is re-imagining mining to improve people’s lives.']
I need geometries associated with extracted paragraphs.
Might be something like this for example:
['Creating value for all stakeholders', [1,2,3,4,]]
'Anglo\xa0American is re-imagining mining to improve people’s lives.', [7,8,9,10]]
How I can accomplish it?
Thanks,
Currently that ability is not a PyPDF2 feature, it has the ability for parsing the content as you show extractText() but does not hold the separate glyph xy positions nor output the lines coordinates.
There are other means in python to extract a single or multiple groups of letters that form words.
Using shell commands such as poppler from / in conjunction with a text "word" from PyPDF2 is possible, however the norm would be to run with another Py PDF Lib such as PyMuPDF and here is such an article, https://pyquestions.com/find-text-position-in-pdf-file for highlighting with PyMuPDF input.
The most common means to your goal is probably as described here How to extract text and text coordinates from a PDF file?

why tesseract can't recognize the english words on this image?

I am using tesseract 4.0 to recognize english words,but fail only on this image ,without any words been recognized,
any one can give a tip,thanks
r=pytesseract.image_to_string('6.jpg', lang='eng')
print(r)
Fail image
update:
I try to OCR with online website
https://www.newocr.com/
and it works,but why?
how can I use tesseract to recognize it?
The problem is pytesseract is not rotation-invariant. Therefore, you need to do additional pre-processing. source
My first idea is to rotate the image with a small angle
img = imutils.rotate_bound(cv2.imread("YD90o.png"), 4)
Result:
Now if I apply an adaptive-threshold
To read with pytesseract you need to set additional configuration:
pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
PSM (page-segmentation-mode) 6 is Assume a single uniform block of text. source
Result:
You want to get the last sentence of the image.
txt = pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
txt = txt.replace('\f', '').split('\n')
print(txt[len(txt)-2])
Result:
Continue Setub ie Gene
The website might use deep-learning method to detect the words in the image. But when I use newocr.com the result is:
oy Eee a
setuP me -
continve ae

tesseract not recognize one number image

i am using tesseract with python. It recognizes almost all of my images with 2 or more numbers or characteres.
But tesseract can't recognizes image with only one number.
I tried to use the command line, and it's giving me "empty page" as response.
I don't want to train tesseract with "only digits" because i am recognizing characters too.
What is the problem?
Below the image that its not recognized by tesseract.
Code:
#getPng(pathImg, '3') -> creates the path to the figure.
pytesseract.image_to_string( Image.open(getPng(pathImg, '3'))
If you add the parameter --psm 13 it should works, because it will consider it as a raw text line, without searching for pages and paragraphs.
So try:
pytesseract.image_to_string(PATH, config="--psm 13")
Try converting image into gray-scale and then to binary image, then most probably it will read.
If not duplicate the image , then you have 2 letters to read. So simply you can extract single letter
Based on ceccoemi answer you could try other page segmentation modes (--psm flag).
For this special case I suggest using --psm 7 (single text line) or --psm 10 (single character):
psm7 = pytesseract.image_to_string(Image.open(getPng(pathImg, '3'), config='--psm 7')
psm10 = pytesseract.image_to_string(Image.open(getPng(pathImg, '3'), config='--psm 10')
More information about these modes can be found in the tesseract wiki.
You can use -l osd for single digit like this.
tesseract VYO0C.png stdout -l osd --oem 3 --psm 6
2

How to extract articles from Hindi Web pages with Goose?

I'm using Python Goose to extract articles from Web pages. It works fine for many languages, but fails for Hindi. I have tried to add Hindi stop as stopwords-hi.txt and set target_language to hi, without success.
Thanks, Eran
Yeah I had the same problem. I've been working on extracting articles in all Indian regional languages and I couldn't extract the content alone with Goose.
If you can work with the article description alone, the meta_description works perfectly. You can use that instead of cleaned_text which doesn't return anything.
Another alternative, but more lines of code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.jagran.com/news/national-this-pay-scale-calculator-will-tell-your-new-salary-after-7th-pay-commission-14132357.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
##removing all script, style and reference links to get only the article content
for script in soup(["script", "style",'a',"href","formfield"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print (text)
Open disclosure: I actually got the original code somewhere on stack overflow only. Modified it a tiny bit.

Printing Umlauts in Matlab

I’m am trying to create a pdf file from matlab figure using cmyk colors, but facing a problem with umlauts and also some other special characters. Is there any other way to handle this than Latex? The following example demonstrates the issue.
plot(rand(199,1))
title_string = ['Some text:äö' char(228) ':2005' char(150) '2008:end text'];
title(title_string);
print(gcf,'-dpdf','cmykfile.pdf','-r600','-cmyk');
print(gcf,'-dpdf','rgbfile.pdf','-r600');
As you can see from the pdf-files the RGB-version handles umlauts, but not en-dash, and CMYK skips them all.
PDF is generated in Matlab using Ghostscript, but I have not found how to configure character encoding for GS.
I am using Windows and Matlab R2014.
I'm not completely sure this is the solution you was looking for.
Anyway, if you create an eps first and then convert it to pdf the output file doesn't have any issue with the special characters in the title, provided that you don't build your title string using char.
plot(rand(199,1))
title_string = 'Some text:äöä:2005—2008æ:end text';
title(title_string);
print(gcf,'-depsc','cmykfile.eps','-r600','-cmyk');
!ps2pdf cmykfile.eps cmykfile.pdf
The code above works if you have the ps2pdf utility in your system path. You already have ps2pdf on your computer if you have MiKTeX installed, but you might need to update your system path. Basically ps2pdf should be a shortcut to gs, therefore also if you have only gs and not MiKTeX installed, you should be able to achieve the same result.
EDIT
On my machine (Windows 7, MATLAB R2014b), also this code works well, without the need to use ps2pdf:
plot(rand(199,1))
title_string = 'Some text:äöä:2005—2008æ:end text';
title(title_string);
print(gcf,'-dpdf','cmykfile.pdf','-r600','-cmyk');
It seems that the issue happens when you build the title string using char.