Is there a way to retrieve text coordinates from PDF file on iPhone?
Thanks,
Nava.
More details: I'm trying to get words from pdf file and highlight them. While it's a pretty simple task in Mac OS X, which has a PDFKit, it's not that trivial on iPhone, which has Quartz set of functions to present and get information from pdf file. So far I tried and succeed in following - get words list from pdf file scanning its content and using Tj and TJ operators (see how to search text in pdf). While Tj gives a string and I can get words from it, TJ is an array of glyphs probably, since most of its members come as a single characters, but connecting them together still gives a string and I can get words from there.
My problem now is to highlight found words, which may be can be done by finding a TD/Td operators and trying to calculate character boxes by myself, but for this I need probably to get a font/style and other characteristics of glyphs to be able to calculate glyph boxes properly. And probably somehow to build a transformation matrix or something like this... Anybody can shed some light?
solved with open source poppler library
I have been trying to do the same but it's too technical to build a parser myself. Then I found FastPDFKit open source sdk recently. There is a free version with sample iOS project that includes search and highlight.
http://mobfarm.eu/fastpdfkit
After reading the other answers I will start exploring Poppler too. If someone has a sample project please let me know :)
Related
I'm using OCR to develop an Android Application using the Tesseract Libs, with the tess-two project, as I saw here: http://gaut.am/making-an-ocr-android-app-using-tesseract/
The app worked fine, but I'm repairing that string returned with the content of a photo, sometimes, comes with strangers characters. Example: I'm reading this: www.caelum.com.br and receiving something like this: r ' . ,wlñzf . 94' kzl 5. vsmNs/.caelumcombr
Searching, I've configured this: baseApi.setVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
But I think that becomes worst.
I want to read texts in Portuguese and English. So, I downloaded the traineddata for each language and using it as I wish, but these strangers characters have something to do with the encoding project ?
Thanks for help :)
Tesseract recognizes text only for images that contains text and only text. Images that contains only text will be accurately recognized by it and you can get good accuracy too.
However Tesseract gives garbled output for image + text recognition.
I didn't worked on this recognition so can't help further.
So your question should be as to how to crop the image part so that you get only the text part out from image. Like that Tesseract can recognize fine and give desired text in ouput.
Thanks.
I have been looking for a way to present an interactive pdf file (created by in-design) on
the iPhone. I read a bunch of questions here but none says how to do it. The pdf file contain the text and in the middle it contains a 3d module, but when I present it on the iPhone it shows only the text and an empty white box where the module should appear.
Is it even possible to do it?
I'll be glad for any assistant on this subject or even where to look.
Thanks in advance,
Shahar.
Apple's PDF parser does not support 3D stuff. You're better of implementing the 3D part yourself and just adding that as a UIView on top of the PDF. There are several PDF frameworks that help with that (see https://stackoverflow.com/questions/3801358/pdf-parsing-library-for-ios)
Another alternative might be licensing Adobe's iOS rendering engine. But I doubt that they already added 3D support (or that they will be). Also, from what my sources tell me, pricing is rather high and apparently the framework not very developer friendly. (But I haven't used it myself)
I am developing App in which PDF text searching & highlighting is needed. I found like its very difficult to highlight in PDF. So i thought to convert PDF to HTML then by using java-script, Search the string & Highlight it. Actually i got success in searching & highlighting on HTML text using java script.If any1 need source code send your email id.
But my obstacle is PDF to HTML conversion. I know it is very hard,bcoz PDF is enrich text & HTML doesn't support all the features. In between i got some source code in Python i.e. PDFMiner. With out jail breaking its hard to use Python in IOS. So i dropped this idea also.
Now i m looking on xPDF, its C++ based code to convert PDF to HTML. Did any1 got success over integrating xPDF into your IOS app. I want to know feasibility of this.
Thanks in advance for ur thoughtful reply,
Naveen Thunga.
Here you can find an example. Still has some problems, but is a good start:
https://github.com/KurtCode/PDFKitten
I need to convert text to an image. Using imagemagick I can get this done.
However, part or all of the text could be in Hebrew (an RTL language).
This means the words in Hebrew are rendered backwards.
If I was assured that the text was only Hebrew, I would have just reversed the text before sending it to ImageMagick. However, this solution won't work if part of the text is in English.
Does anyone have any idea how this can be done?
P.S. I'm not committed to using ImageMagick, if a better way comes up.
However, the solution should work for both Linux and Windows (I might be able to live with a non-windows solution, but a multi OS solution is preferable).
Thanks,
Niv
i see this link
http://www.experts-exchange.com/Software/Photos_Graphics/Web_Graphics/Q_21766928.html
they suggest
Maybe Unifier (http://www.melody-soft.com/html/unifier.html) or Encoding Master (http://www.elfdata.com/encodingmaster/index.html)
Sounds like your real issue is to re-order the bidirectional text for imagemagick. A job for the Unicode bidirectional algorithm. See http://unicode.org/reports/tr9/ That report lists two reference implementations. Or see this one: http://fribidi.org/
Is there a test suite for PDFs, preferably in Perl? What I want is some function to test positioning and existence of some text (and if possible a name of a grapic) in a PDF file. Is this theoritically possible with PDF markup?
Thank you for your help.
As jrockway said, there's not a 100% solution available today. With my CAM::PDF library, you can compute positions for any element in the document. See my answer to "How do I get character offset information from a pdf document?" which shows how to extract coordinates for all text on a page.
I don't think there is anything pre-built on the CPAN, but Test::Builder and CAM::PDF should allow you to write what you want.
Once you get it working, upload it to the CPAN... and then there will be a way to test PDFs on the CPAN :)