Extracting the cover image from a word document

Extracting the cover image from a word document - ms-word

My task involves extracting the cover image in a word document.The heuristics that I am following is "if the first page of the document contains only an image then it is a cover image otherwise there is no cover image ".So i need to get the content of the first page alone and check if it has only an image.How can i do it ?
I tried a bunch of wordprocessing API like POI,docx4j,etc . But these API doesnt have any provisions to identify a particular page's content.I have also tried to write my own XML parsing.I understand that word document is reflowable and openxml of the docxfile doesnt have any clue regarding implicit page breaks.
I have posted a [question regarding this]: Finding implicit page break in word document using xml parsing
and there were no helpful answer.
So if this cannot be done via xml parsing of the openxml of the word document, what is the best way to do it ? Is there any helpful API in Java for this task ?

Related

generate word file through open xml

I am using openxml to generate my word file that contains user input messages and attachment if there are any. Now, I am stuck in a situation where I don't know how to display PDF /JPEG/JPG if user attached such things with the inputted message.
Is there any way I can show the above attached in my generated word file.
Thanks

MSDN has the specific example of adding the images to the document. The sample code you could use is provided here.
http://msdn.microsoft.com/en-us/library/bb497430%28v=office.14%29.aspx

Creating PDF from HTML using iTextSharp with TextField and editing the same later

I have a requirement to create a PDF file from HTML. The resulting PDF needs to have iTextSharp TextField or something similar. I need to update the PDF document with appropriate text in the text field.
Points to note:
1. The PDF length (page numbers) may vary.
2. Due to this, I may have to only know the name of the field to set value to.
OR
I could create a PDF from HTML. As the content of the PDF may vary, I do not know the exact location of a block that I need to edit. I need to stamp text exactly over the block irrespective of the location of the block (i.e. the block may exist in any page).
Example Scenario:
Create a PDF from HTML.
It is sent for approval process. Once it is approved, the name of the approver is printed at a specific place (however the signature area, mentioned as block above, may come at any page, as it depends on the content of the HTML).

Two resources which may help you:
This article details how to use asp.net & itextsharp to create a pdf.
The whole article is pretty useful for a beginner like me, but the section detailing how to create a pdf from HTML may be useful for you as a start on your problem.
https://web.archive.org/web/20211020001758/https://www.4guysfromrolla.com/articles/030911-1.aspx
I would especially pay attention to how he replaces placeholders in the HTML template he is loading. As it seems you may want to head in that direction.
Now to answer your question more directly, have you thought about using a fillable form?
Here is a related Stack Overflow post.
Creating a fillable PDF form with ITextSharp
As I said I am a beginner, so I can't do much to help you from here. But with any luck you can put those together and accomplish what you are looking for.
Let me know if that helps Good luck!

How to read PDF file from right to left on iOS

I am working on a basic project that reads pdf files from a server and show them on the screen.
The issue is that i want to read that files from right to left as a page.

Like Massimo Cafaro say :
If you want to extract some content from a pdf file, then you may want to read the following:
Parsing PDF Content
from the Quartz 2D programming guide.
Basically, you will use a CGPDFScanner object to parse the contents, which works as follows. You register a few callbacks that will be automatically invoked by Quartz 2D upon encountering some pdf operators in the pdf stream. After this initial step, you then actually start parsing the pdf stream.
Taking a brief look at your code, it appears that you are not following the steps required to parse the pdf content of the page you get through CGPDFDocumentGetPage(). You need first to setup the callbacks using CGPDFOperatorTableCreate() and CGPDFOperatorTableSetCallback(), then you get the page, you need to create a content stream using that page (using CGPDFContentStreamCreateWithPage()) and then instantiate a CGPDFScanner through CGPDFScannerCreate() and actually start scanning through CGPDFScannerScan().
The "Parsing PDF Content" section of the document pointed out by the above URL gives you all of the information required to implement pdf parsing.
if you don't try anything you can start with this project link

PDF to Jasper XML

I have a PDF file (softcopy) which was created using iText. Now my company decided to use JasperReports for new release. I need to use that PDF file (softcopy) and need to design JasperReports template and need to populate data.
Do we have any plugin in JasperReports that can convert from PDF to JasperReports JRXML or what do I need to do? Any suggestions?

A PDF is a description of how to render a document on a page. Things
like "draw a vertical line here", "write 'foo bar baz' here in
Courier". It does not contain any information about the format or
organisation of the stuff it is rendering. You won't be able to tell
that you're looking at a table, or a list of bullet points, or a
paragraph, or anything like that.
The PDF format does contain information on a page-by-page basis.
Therefore, page breaks are the one piece of format/organisation
information that you can find.
If you want anything more than a raw stream of completely unformatted,
disorganised text, one per page, you are out of luck. It's virtually
impossible.
from javaranch

You can use http://xmlprinter.com/ and then use a xslt to transform the resulted xml to the desired jrxml.
I'm working in it. If I finish it, i will post the result on github or any other public and open place.
Good Luck

Converting large amounts of text and dynamic data into PDF

I have a three page Word document that needs to be converted into PDF. This Word document was given to me as a template to show me what the PDF output should look like. I tried converting this document into PDF, created a PDF form and used iTextSharp to open the form, populate it with data and return it back to the client. This is all great but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden.
My second attempt was to create an MVC 2 View without master page, pass the model to the view, take the HTML representation of the View, pass it over to iTextSharp and render the PDF. The problem here was that iTextSharp failed on some tags (one of them was <hr> tag). I managed to get rid of the problematic tag, but then tables were not rendered properly. Namely, the border attribute was ignored so I ended up with borderless tables. That attempt failed.
I need a suggestion or advice on the most efficient way to create a PDF document in MVC 2 which would be maintainable in the long run. I really don't want my actions to be 200+ lines long. Working directly with the Word document is not the best solution as I have never worked with VSTO so I don't quite know what it would look like to open Word and manipulate text inside of it and add dynamic data and then convert that dynamically into PDF.
Any suggestion is highly welcome.
Best regards!

One thing that I've done in the past is to save the Word file as a DOCX and unzip it since DOCX is just a renamed zip file. Within the archive open up /word/document.xml and you'll see your document. There's a lot of weird XML tags in there but overall you should get a pretty good idea of where your content is. Then just add placeholder text like {FIRST_NAME}, save the file and re-zip.
Then from code you can just perform the same steps, unzipping with something like SharpZipLib or DotNetZip, swapping placeholder copy, re-zipping and then using very simple Word automation to Save-As a PDF.
The other route is to fully utilize iTextSharp and actually write Paragraphs and PdfPTable and everything else. It takes a lot longer to setup but would give you the most control.

Q: you say "... but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden"
How do you end up having to much data ? If the word template can "hold" the data in 3 pages, they should fit in 3 PDF pages.
I used to use iTextSharp to create my PDF's, but I also almost always ended up building the PDF document from scratch myself.(not really a <200 line solution) Have you considerate another library, I recently switched to MigraDoc's PDFSharp.Way simpler to use then iText, lotsa examples / docus
Just my two cents

Word documents object model is quite easy to understand. It will either contain series of Paragraphs or Tables. Using the Open XML SDK, you can iterate through each paragraph/table in the word document and retrieve it's content and styles. Then you can generate PDF document on the fly using those retrieved information. This will work under MVC too.
But if your word document contains complex elements, then it will take some more time for you to implement based on this approach. Also, this approach would only work with (Word 2007 and 2010) files.
Also, HTML to PDF options currently available in the ITextSharp library would work with only known set of tags, as far as I know.
Another suggestion is to make use of commercially available .NET components. There are lot of good solution available. For ex: Syncfusion