OCR reconizes stranges characters. Why? - character

I'm using OCR to develop an Android Application using the Tesseract Libs, with the tess-two project, as I saw here: http://gaut.am/making-an-ocr-android-app-using-tesseract/
The app worked fine, but I'm repairing that string returned with the content of a photo, sometimes, comes with strangers characters. Example: I'm reading this: www.caelum.com.br and receiving something like this: r ' . ,wlñzf . 94' kzl 5. vsmNs/.caelumcombr
Searching, I've configured this: baseApi.setVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
But I think that becomes worst.
I want to read texts in Portuguese and English. So, I downloaded the traineddata for each language and using it as I wish, but these strangers characters have something to do with the encoding project ?
Thanks for help :)

Tesseract recognizes text only for images that contains text and only text. Images that contains only text will be accurately recognized by it and you can get good accuracy too.
However Tesseract gives garbled output for image + text recognition.
I didn't worked on this recognition so can't help further.
So your question should be as to how to crop the image part so that you get only the text part out from image. Like that Tesseract can recognize fine and give desired text in ouput.
Thanks.

Related

Ephesoft: tesseract isn't working

I'm trying to learn files but for some reason or another, some pdf's that come from exactly the same scanner, and are visually of similar quality don't work.
I don't get an error, everything works but it's just that it isn't abble to actyually ocr anything. the ocr file is just empty. it's like the Tesseract isn't abble to find any words on it.

Convert text from image into text file

I have an image and I want to convert it text file to use in word processing software. 1) Can it be done in any software. 2) Is it possible to write a program in Matlab or any other language that can convert it to text. The font is really poor in the image file.
You're talking about OCR where there are existing libraries that can be used for this. I suggest you take a look at Leadtools OCR. I used it in .NET environment and it can convert images to text.
Yes, it can be converted into text by using softwares like Microsoft OneNote or others. You can also write programmes for creating an OCR in most of the programming languages.

Integrating xPDF in an IOS? (feasibility checking)

I am developing App in which PDF text searching & highlighting is needed. I found like its very difficult to highlight in PDF. So i thought to convert PDF to HTML then by using java-script, Search the string & Highlight it. Actually i got success in searching & highlighting on HTML text using java script.If any1 need source code send your email id.
But my obstacle is PDF to HTML conversion. I know it is very hard,bcoz PDF is enrich text & HTML doesn't support all the features. In between i got some source code in Python i.e. PDFMiner. With out jail breaking its hard to use Python in IOS. So i dropped this idea also.
Now i m looking on xPDF, its C++ based code to convert PDF to HTML. Did any1 got success over integrating xPDF into your IOS app. I want to know feasibility of this.
Thanks in advance for ur thoughtful reply,
Naveen Thunga.
Here you can find an example. Still has some problems, but is a good start:
https://github.com/KurtCode/PDFKitten

get text coordinates from pdf on iphone

Is there a way to retrieve text coordinates from PDF file on iPhone?
Thanks,
Nava.
More details: I'm trying to get words from pdf file and highlight them. While it's a pretty simple task in Mac OS X, which has a PDFKit, it's not that trivial on iPhone, which has Quartz set of functions to present and get information from pdf file. So far I tried and succeed in following - get words list from pdf file scanning its content and using Tj and TJ operators (see how to search text in pdf). While Tj gives a string and I can get words from it, TJ is an array of glyphs probably, since most of its members come as a single characters, but connecting them together still gives a string and I can get words from there.
My problem now is to highlight found words, which may be can be done by finding a TD/Td operators and trying to calculate character boxes by myself, but for this I need probably to get a font/style and other characteristics of glyphs to be able to calculate glyph boxes properly. And probably somehow to build a transformation matrix or something like this... Anybody can shed some light?
solved with open source poppler library
I have been trying to do the same but it's too technical to build a parser myself. Then I found FastPDFKit open source sdk recently. There is a free version with sample iOS project that includes search and highlight.
http://mobfarm.eu/fastpdfkit
After reading the other answers I will start exploring Poppler too. If someone has a sample project please let me know :)

Converting hebrew text to an image using imagemagick

I need to convert text to an image. Using imagemagick I can get this done.
However, part or all of the text could be in Hebrew (an RTL language).
This means the words in Hebrew are rendered backwards.
If I was assured that the text was only Hebrew, I would have just reversed the text before sending it to ImageMagick. However, this solution won't work if part of the text is in English.
Does anyone have any idea how this can be done?
P.S. I'm not committed to using ImageMagick, if a better way comes up.
However, the solution should work for both Linux and Windows (I might be able to live with a non-windows solution, but a multi OS solution is preferable).
Thanks,
Niv
i see this link
http://www.experts-exchange.com/Software/Photos_Graphics/Web_Graphics/Q_21766928.html
they suggest
Maybe Unifier (http://www.melody-soft.com/html/unifier.html) or Encoding Master (http://www.elfdata.com/encodingmaster/index.html)
Sounds like your real issue is to re-order the bidirectional text for imagemagick. A job for the Unicode bidirectional algorithm. See http://unicode.org/reports/tr9/ That report lists two reference implementations. Or see this one: http://fribidi.org/