So I'm brand spanking new to iTextSharp and I know I have quite a bit of reading ahead of me but in an attempt to shave a bunch of time off a relatively trivial task I thought I reach out the stack brain-trust.
I have a very simple goal: Starting with a template pdf, I need to create new pdf with a few of the characters changed. We're talking single characters on each page. I don't need a detailed answer complete with code (although that'd be awesome) so much as a general list of tools and api's I'm going to need.
The data I need will already be in a db which I could output to xml files if need be.
So far it looks like my template will need the "editable" characters tagged somehow (not sure how to do that yet) and using PDFStamper I can modify the copy. Is that the right path or is there a better way?
Thanks for any insight.
Related
I think a student of mine renamed a PNG a Word document and intentionally submitted a corrupted file to buy more time (or something) on an assignment. The student denies everything and claims it was a computer malfunction. Before I submit an honor code violation I want to be sure that there's no explanation that does not involve cheating that I'm somehow overlooking.
Basically, I'm a TA and a student submitted a paper, let's say it was Smith.docx. When I was working on grading and went to open Smith.docx Word wouldn't open it and said that it was corrupted. I eventually had the idea of opening it in a text editor and there it was a massive jumbled file of all sorts of odd characters (total file size: 180kb for what was supposed to be a 5 page paper).
I noticed, though, that the first few characters of the file were:
‰PNG
I renamed the file Smith.png and it opened. Bizarrely, it was an image of the first page of a Word document. More specifically, it looks like a screenshot of a Word doc cropped so as to show just the page. What makes it seem like a screenshot is that the cursor thingy (the vertical bar marking where you're typing) shows up next to the title.
An additional interesting bit of data is that if I scroll further down in the file (opened in notepad) I come to this:
XML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.4.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:PixelXDimension>996</exif:PixelXDimension>
<exif:PixelYDimension>1286</exif:PixelYDimension>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
I'm not sure what all that means but 1286x996 are the dimensions of the png image. The rest suggests to me that the file was created in some Adobe program but I'm not sure if that's right and how to figure out more about that.
So, my actual question: Is there any conceivable explanation of any kind for how I would come to have a file called Smith.docx that is a perfectly functioning png of what sure looks like a screenshot of the first page of a Word document other than that the student did it on purpose? The student claimed that their computer was "corrupting" files and that they had to take it into the Apple for service. I find this incredibly implausible (student has also not provided the receipt for this, which I requested).
Additionally, other than the case I laid out here, is there any positive evidence for my theory (that it was a straightforward case of cheating) that I can present to strengthen my case? eg, is the data from the file that I posted above a smoking gun that it was created in an Adobe program or is there any conceivable way that could come out of a Word document or other sort of corrupted file?
Also, is there anything else I can look for in the PNG file that would be a smoking gun?
Thanks in advance for any help you might be able to offer!
just rename the file with .png at the end instead of .docx and if it was a png then it should open just fine as a png
The key is that you see the cursor in the screenshot, there is no way Word would export (somehow) a docx file as a png AND draw the cursor for typing. Also, any tool that could do that would save the file as png not docx, only the user could deliberately change the file extension.
Also, does the screenshot show an empty document? or it looks like the final document your student delivered at the end?
Short answer:
The student is lying and is in fact a cheater (in my opinion).
Also, even if they were telling the truth, it is still their responsibility to have their work done, ready, and fully functional on time. Your computer is corrupting your files? Tough cookies. No one cares. You should have done your work on another computer. In the real world, excuses don't get you anywhere and they shouldn't get you anywhere in school either.
Lastly, it is very easy to re-name an extension of another file type and claim it's corrupt and very unlikely that a computer is just creating corrupted files. If their computer would otherwise create corrupted files, I would imagine it would be nearly impossible to get the computer to boot. In other words, they probably wouldn't have been able to turn on their "corrupted" computer to create "corrupted" files in the first place.
So, as the title says, I would like to make an automated script that is going to take all the text from one PDF page, copy it, paste it into Google Translate and then copy the translated text into another Microsoft Word document.
Since that PDF has a lot of pages (150+), I thought it may be easier to make an automated script to do that.
What language would I have to use, would it be complicated for me to do it and in the end, will I actually save time by using this script (implying that I have to learn it first, but I have some programming experience (I know C++, Javascript, PHP), but I do not have a strong grasp of algorithms (like Flood Fill, ...))?
Thanks in advance!
EDIT : I found that I could use AutoIt for scripting... but I don't know would I be better off using AutoIt or Powershell... I also want to learn something that would be enable me to create other scripts (for example to automate some processes I do in Camtasia Studio)... So, AutoIt or Powershell?
As an AutoIt user I would say AutoIt.
Copying text out of PDFs is not quite as simple as you might imagine. Mileage will vary on how the PDF was created, and there are several methods you can use:
Most PDFs will have most of the text in the file itself, allowing you to get the text using a simple method like this
This method uses zlib to do something to the pdf. Not sure what as I've never tried it.
There are a variety of examples of using third party programs to do this, which may be better. There is one using Debenu and another using XPDF
Automating other programs such as acrobat should be possible, in acrobats case they have an api that can be used, though I'm not aware of this already being wrapped in AutoIt.
As to the rest of the requirements, there is a UDF to translate with google translate here, and the word UDF is a standard one that comes with the AutoIt installation.
I have a Word document(some template format) where it containing some placeholders for the data to be filled in and there are several Word documents like this which lies in some directory. When data comes I will be choosing different templates (based on some criteria) and fill the data and the documents have to be converted to PDF format.
I have been investigating Apache POI for this. If anyone has a good suggestion, it would be much appreciated.
As mbeckish mentioned you should indicate how you are going to run/automate this. For example is it one-off, run by hand or part of another program (and if so what programming languages do you use)?
If you are trying to automate it JODReports and Docmosis are tools that can use templates like you require and can produce PDF. JODReports is free. Docmosis is not but has several APIs. Please note I work for the company that develops Docmosis.
Hope that helps.
I've just uploaded this presentation, which presents three approaches for doing this.
Why not use any of existing PDF virtual printers?
I have a three page Word document that needs to be converted into PDF. This Word document was given to me as a template to show me what the PDF output should look like. I tried converting this document into PDF, created a PDF form and used iTextSharp to open the form, populate it with data and return it back to the client. This is all great but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden.
My second attempt was to create an MVC 2 View without master page, pass the model to the view, take the HTML representation of the View, pass it over to iTextSharp and render the PDF. The problem here was that iTextSharp failed on some tags (one of them was <hr> tag). I managed to get rid of the problematic tag, but then tables were not rendered properly. Namely, the border attribute was ignored so I ended up with borderless tables. That attempt failed.
I need a suggestion or advice on the most efficient way to create a PDF document in MVC 2 which would be maintainable in the long run. I really don't want my actions to be 200+ lines long. Working directly with the Word document is not the best solution as I have never worked with VSTO so I don't quite know what it would look like to open Word and manipulate text inside of it and add dynamic data and then convert that dynamically into PDF.
Any suggestion is highly welcome.
Best regards!
One thing that I've done in the past is to save the Word file as a DOCX and unzip it since DOCX is just a renamed zip file. Within the archive open up /word/document.xml and you'll see your document. There's a lot of weird XML tags in there but overall you should get a pretty good idea of where your content is. Then just add placeholder text like {FIRST_NAME}, save the file and re-zip.
Then from code you can just perform the same steps, unzipping with something like SharpZipLib or DotNetZip, swapping placeholder copy, re-zipping and then using very simple Word automation to Save-As a PDF.
The other route is to fully utilize iTextSharp and actually write Paragraphs and PdfPTable and everything else. It takes a lot longer to setup but would give you the most control.
Q: you say "... but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden"
How do you end up having to much data ? If the word template can "hold" the data in 3 pages, they should fit in 3 PDF pages.
I used to use iTextSharp to create my PDF's, but I also almost always ended up building the PDF document from scratch myself.(not really a <200 line solution) Have you considerate another library, I recently switched to MigraDoc's PDFSharp.Way simpler to use then iText, lotsa examples / docus
Just my two cents
Word documents object model is quite easy to understand. It will either contain series of Paragraphs or Tables. Using the Open XML SDK, you can iterate through each paragraph/table in the word document and retrieve it's content and styles. Then you can generate PDF document on the fly using those retrieved information. This will work under MVC too.
But if your word document contains complex elements, then it will take some more time for you to implement based on this approach. Also, this approach would only work with (Word 2007 and 2010) files.
Also, HTML to PDF options currently available in the ITextSharp library would work with only known set of tags, as far as I know.
Another suggestion is to make use of commercially available .NET components. There are lot of good solution available. For ex: Syncfusion
I'm trying to generate word documents using open xml sdk. When the documents are small this is no problem (and rather easy). When the documents become larger (+500 pages) I notice the peformance (duration, memory usage, ...) goes down significantly.
Googling this problem I came across some posts that point out the same problem. For excel there is a solution with spreadsheetgear.
I would like to know if there is a word alternative to this or if there are other solutions to generate word documents?
Thanks,
Jelle
I've written a blog post series on generating Open XML WordprocessingML documents. The approach that I take is that you create a template Word document, insert content controls, and then write XPath expressions in those content controls to specify the XML to pull from a source XML data file. I've also explored another approach where you write C# code in Open XML content controls. That approach also works.
http://ericwhite.com/blog/map/generating-open-xml-wordprocessingml-documents-blog-post-series/
-Eric
You might look at http://docx.codeplex.com/
On Java, you could use docx4j. If you were brave, you could create DLLs for it via IKVM...
I decided to go with Aspose Words. It is really fast and not very demanding on resources (CPU, memory). It has the disadvantage that it is quite expensive. I also investigated Softartisans Office writer. The posibilities are the same but due to fact that the company I'm currently working for already used other Aspose components we decided to go with Aspose Word.