Matlab access PDF as an array of images - matlab

Building a system which search for a specific region in the picture, and saves it. Everything works fine. Mostly I am going to extract these regions from pdf books.
So I am looking for a solution to treat PDF file in matlab as an array of images (each page is an image). Up till now the only thing I have found is how to open pdf files in matlab.
The best solution I came up with is to export PDF as many PNG images and iterate through them. There is nothing bad with these idea, but I am wondering am I missing something

Judging from this page it appears to be impossible to import pdf directly into matlab:
And a quick file exchange search for 'pdf import' only offers an attempt to extract text, rather than the images.
So all in all your approach of saving the pdf as images and then importing them seems to be the way to go.

I agree with Salvador Dali and Dennis. To convert each page of the PDF to a png image, I downloaded imagemagick and followed the commands here:
https://aleksandarjakovljevic.com/convert-pdf-images-using-imagemagick/
Specifically:
convert -density 150 -antialias "input_file_name.pdf" -resize 1024x -quality 100 "output_file_name-%03d.png"
Of course, there are other discussion about using ImageMagick for this purpose:
Converting a PDF to PNG and
Convert PDF to PNG using ImageMagick
This is an old thread, but it's the one I found when I asked the same question, so I thought I would elaborate in case it's helpful to future users who also land on this thread.

Related

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file.
Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?
The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.
In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.
You can try this! This lets you read PDF pages.
I recently ran into this problem. I wasn't able to make it work with itext.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Images in OOXML (Office Open XML) standard documents are damaged. Where I can find a good one?

We are working on a project to deal with OOXML format, specifically DOCX format. We downloaded PDFs from ISO site (http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html) but found all images in the PDF are black. Some images have colored lines but none of them has text.
Is there anyone read the standard?
Where I can get a good document with good images.
Thanks
You can take a look at the ECMA-376 version of the standard at the following link. I would download the third edition set of the pdf's as they are the most recent to date.

Change in resolution in pdf after save by itextsharp

New to itextsharp. I'm not sure if this is the right forum; this is due to the fact that there are three programs involved in the project I'll describe below: silverlight 4, Amyuni's PDF for Silverlight, and itextsharp 4. Add to that the fact that I'm using code I got in a project off the web to translate the silverlight inkpresenter into an image. This includes and "editableimage" class that calls a png encoder class. As you can see in my rush to get this working I've found many tools, any one of which may be causing my problem (translate that to mean that I am using one or more them incorrectly :-).
I have a feeling it's something in the way I'm using itextsharp to save a pdf though it occurred to me that the pngencoder may have something to do with it. At the very least I can see it doesn't compress the png that it creates.
I have a project where I am loading a pdf from a file to a silverlight inkpresenter using Amyuni's Pdf for silverlight. As proof of concept I brought the first page of a pdf into the inpresenter using Amyuni, created a bitmap using a writeablebitmap, passed that to the editableimage object and the png encoder mentioned above. The png is then streamed to an httphandler where itextsharp converts it to a pdf. This pdf is saved in a database table. I made sure the rectangle for the pdf had the same dimensions as the bitmap created by the writeable bitmap and editableimage.
I then used Amyuni Pdf for Silverlight to read the pdf saved in the database back to the inkpresenter. For some reason the loaded pdf is bigger than the original page from the pdf file. The font is larger, less of the pdf fits into the same inkpresenter. I'm not sure but it seems like the dimensions of the pdf page saved to the database are larger than they were when they were loaded into the same inkpresenter from the file. I suspect that it's some mistake I'm making when saving the pdf page using itext sharp. I have seen posts here in stackoverflow where other people have experienced the same thing. I've done my best to figure this out by googling but, unfortunately, it's hard to pin the issue down considering all the differnt kinds of software I'm using.
Any advice would be appreciated.
Fig000
If you are already using Amyuni PDF Creator for showing PDF files in Silverlight, you could also use it to generate your PDF files with the png image, at server side.
The code will look like this:
PDFCreactiveX pdfdoc = new PDFCreactiveXClass();
pdfdoc.CreateObject(ObjectTypeConstants.acObjectTypePicture, "Picture1");
pdfdoc.set_ObjectAttribute("Picture1", "FileName", "C:\\mytemppicture.png");
pdfdoc.set_ObjectAttribute("Picture1", "Left", 0);
pdfdoc.set_ObjectAttribute("Picture1", "Top", 0);
pdfdoc.Save("c:\\mytemppdf.pdf", FileSaveOptionConstants.acFileSaveDefault);

get text coordinates from pdf on iphone

Is there a way to retrieve text coordinates from PDF file on iPhone?
Thanks,
Nava.
More details: I'm trying to get words from pdf file and highlight them. While it's a pretty simple task in Mac OS X, which has a PDFKit, it's not that trivial on iPhone, which has Quartz set of functions to present and get information from pdf file. So far I tried and succeed in following - get words list from pdf file scanning its content and using Tj and TJ operators (see how to search text in pdf). While Tj gives a string and I can get words from it, TJ is an array of glyphs probably, since most of its members come as a single characters, but connecting them together still gives a string and I can get words from there.
My problem now is to highlight found words, which may be can be done by finding a TD/Td operators and trying to calculate character boxes by myself, but for this I need probably to get a font/style and other characteristics of glyphs to be able to calculate glyph boxes properly. And probably somehow to build a transformation matrix or something like this... Anybody can shed some light?
solved with open source poppler library
I have been trying to do the same but it's too technical to build a parser myself. Then I found FastPDFKit open source sdk recently. There is a free version with sample iOS project that includes search and highlight.
http://mobfarm.eu/fastpdfkit
After reading the other answers I will start exploring Poppler too. If someone has a sample project please let me know :)

Is there a test suite for PDF files?

Is there a test suite for PDFs, preferably in Perl? What I want is some function to test positioning and existence of some text (and if possible a name of a grapic) in a PDF file. Is this theoritically possible with PDF markup?
Thank you for your help.
As jrockway said, there's not a 100% solution available today. With my CAM::PDF library, you can compute positions for any element in the document. See my answer to "How do I get character offset information from a pdf document?" which shows how to extract coordinates for all text on a page.
I don't think there is anything pre-built on the CPAN, but Test::Builder and CAM::PDF should allow you to write what you want.
Once you get it working, upload it to the CPAN... and then there will be a way to test PDFs on the CPAN :)