I am trying to split a multi-paged PDF document that contains form fields (check boxes, text boxes). In case it is important to know, the PDF's form fields are initially all blank/empty/un-checked. I have tried using pdftk and pdfsam, but the form fields in the split pages do not work; they appear, but I am unable to use them. I can merge multiple PDF documents that contain form fields, and the final product has all the form fields functioning, so it works with merging.
I read the man page for pdftk, searched forums, and have begun reviewing the source (The source seems to be composed of C++ and Java, which are currently beyond my knowledge base), but I have not located an answer to the question: How can I split a PDF document that contains multiple pages with form fields AND keep the form fields functioning in the split documents?
I would prefer a solution that uses pdftk, and/or any open-source solution.
Related
I want to merge two or more docx files (append them after each other) or move one part of the document(XWPFParagraph) to other place.
The problem is that listings always breaks after such an operation. Say we have a listing in a document which has sequence numbers then we have other listing in another document which has bullets or letters. Than after the copy all of the bullets becomes numbers (or worse numbers which starts from where the previous listing has been ended).
I have tried several solutions :
-traversing BodyElements and copying Paragraphs and Tables by hand like here.
-attaching a newBody into an existing one like here here
Aside from page scoped styles they work well. But the listings never. Is that means the listing symbols are stored as page scoped information (otherwise it would be copyied successfully with the XWPFParagraph)? If yes than why and where?
I have dig myself into the javadoc: https://poi.apache.org/apidocs/dev/org/apache/poi/xwpf/usermodel/XWPFDocument.html
But couldn't find anything about the listings.
The Word numberings (numbered lists but bullet lists also) in Office Open XML file format are stored in /word/numbering.xml of the *.docx ZIP archive. There are abstractNum elements describing the list format and num elements referencing the abstractNum. The numId of the numelements are referenced in paragraphs of /word/document.xml to set which numbering formats shall be used in that paragraph. Paragraphs referencing the same numId are in the same list.
Paragraphs referencing different numId are in different lists.
In apache poi there are XWPFNumbering representing the document part /word/numbering.xml and XWPFAbstractNum representing the abstractNum.
Until now there is no way creating XWPFAbstractNum from scratch without using the low level ooxml-schemas classes.
Also, as far as I know, there is no simple way to merge /word/numbering.xml document parts of different Word documents because of the need handling the different Ids in /word/numbering.xml as well as their occurrences in /word/document.xml. This is very complex and I do not know any free library which can do this properly.
In general, as far as I know, there is no simple way to merge different Word documents together because of the complex storage in Word file formats. All provided possibilities using free code are only halfway useful (traversing and copying), if not wrong and useless (simply attaching multiple document bodys one after the other) at all.
I am creating a database in MS-Access to manage product specifications. Some products have overlapping specifications and some are unique.
My initial approach to this problem was to create multiple forms. For Example: an extension cord specification needs copper, compound and plug information. This information is also needed for a power strip specification, as well as other information that is ONLY pertinent to power strips. Currently I have created multiple forms for the different products. Is it possible to save multiple forms into a pdf for the same part number?
I know I can create a long form that is split on different pages, this method could work if I could not save or print certain pages depending on the product too. Any help is greatly appreciated.
I have shown in the picture below a portion of the forms and how some information is relevant only to the product class.
I know I could create forms for a basis of every category of inventory, but then I would have to create a full specification form for over 10 products, I am trying to streamline and reduce the amount of front end work required to generate specs from our database
TLDR/Question
How can I best assign unique IDs to (ideally all) of the elements in the XML that describes a Word document such that I can read/write those unique IDs from a Word (2013) Add-In?
Additionally, solutions describing ways I can get a good diff of two Word documents might be helpful but this is not the primary question.
Background
I'm creating an application-level add-in for Word (2013) using VSTO. Part of my task involves diffing an original Word document W with a modified W' so that I can then process the diff for another task. While Word clearly has the capability for diffs/merges (available in the "Review" panel in Word 2013) thus far I have not been able to find a way to programatically extract the diffs.
Therefore, I plan to get the XML for the documents (e.g. using Range.WordOpenXML) and diff them. There are a number of published algorithms for diffing XML documents (i.e. Diff(W.XML, W'.XML)) where the accuracy of the diff is largely dependent on being able to properly match the XML elements from the two documents.
Proposed Solution and Its Problems
Therefore, I'd like to be able to assign a unique ID for every element in the XML of the Word document that I can access from my Add-In. In this case a solution would be something like importing a custom namespace into the package called mynamespace and adding the attribute mynamespace:ID=*** for every element in the DOCX package. The attribute would then be accessible via Range.WordOpenXML.
However, simply using mce:Ignorable, mce:ProcessContent, and mce:PreserveAttributes as detailed at http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2012/09/21/markup-compatibility-and-extensibility.aspx does not work. The modified Word document loads without any issues, however I cannot seem to find any of the attributes, additionally saving the document removes all of the added markup.
From http://openxmldeveloper.org/discussions/formats/f/13/p/8078/163573.aspx it appears that this process of using custom xml via the Markup Compatibility and Extensibility (MCE) portion of the Office Open XML standard has become complicated over the years (patent issues, etc.). Therefore I'm guessing that my issues arise because Word's XML processor just removes all of the markup that it cannot natively process (maybe there is a way to hook into Word's XML processor and give it custom commands?).
For the future viewers:
1) There is absolutely no way to set any kind of id for most of elements, which can survive in Word (you can use any custom tags or attributes, but after MS Word opens the document, it's gone)
2) Only two elements can be used as id - ContentControl, they have ids, and bookmark (it is possible to make a hidden bookmark adding underscore before it's name, it works only from code), their name can be an id.
3) If tracking changes is enabled in Word, it is absolutely possible to see the diffs in XML, using Range.WordOpenXML and getting actual OpenXml from it, as explained here, for example.
I am generating and storing PDFs in a database.
The pdf data is stored in a text field using Convert.ToBase64String(pdf.ByteArray)
If I generate the same exact PDF that already exists in the database, and compare the 2 base64strings, they are not the same. A big portion is the same, but it appears about 5-10% of the text is different each time.
What would make 2 pdfs different if both were generated using the same method?
This is a problem because I can't tell if the PDF was modified since it was last saved to the db.
Edit: The 2 pdfs visually appear exactly the same when viewing the actual pdf, but the base64string of the bytes are different
Two PDFs that look 100% the same visually can be completely different under the covers. PDF producing programs are free to write the word "hello" as a single word or as five individual letters written in any order. They are also free to draw the lines of a table first followed by the cell contents, or the cell contents first, or any combination of these such as one cell at a time.
If you are actually programmatically creating the PDFs and you create two PDFs using completely identical code you still won't get files that are 100% identical. There's a couple of reasons for this, the most obvious is that PDFs support creation and modification dates. These will obviously change depending on when they are created. You can override these (and confuse everyone else so I don't recommend this) using something like this:
var info = writer.Info;
info.Put(PdfName.CREATIONDATE, new PdfDate(new DateTime(2001,01,01)));
info.Put(PdfName.MODDATE, new PdfDate(new DateTime(2001,01,01)));
However, PDFs also support a unique identifier in the trailer's /ID entry. To the best of my knowledge iText has no support for overriding this parameter. You could duplicate your PDF, change this manually and then calculate your differences and you might get closer to a comparison.
Then there's fonts. When subsetting fonts, producers create a unique internal name based on the original name and an arbitrary selection of six uppercase ASCII letters. So for the font Calibri the font's name could be JLXWHD+Calibri one time and SDGDJT+Calibri another time. iText doesn't support overriding of this because you'd probably do more harm than good. These internal names are used to avoid font subset collisions.
So the short answer is that unless you are comparing two files that are physical duplicates of each other you can't perform a direct comparison on their binary contents. The long answer is that you can tweak some of the PDF entries to remove unique parts for comparison only but you'd probably be doing more work than it would take to just re-store the file in the database.
!!! UPDATED !!!
We have spreadsheets of complex product data coming in from multiple sources (internal, customers, vendors).
Since the authorship is so diverse, it's impractical to try governing formatting details such as column order and the number of header-rows.
These CSV spreadsheets will be uploaded to our DB via an existing form.
(My first Zend_Form ... I'm almost done with it)
The user needs to see a sample from a given spreadsheet so they can Map the columns and start-row.
To achieve that, I need to generate an html table of that dynamic content, and weave the form elements in and around the table data.
The user would select which values are to be found in each column, and identify the first row of data (after any header rows).
CLICK HERE to see an example.
(NOTE: Most of my work here is under an NDA, so contrived examples is the best we can get :)
In this example, I'd expect the output to be:
_POST('first_row'=>2, 'column0'=>'mi', 'column1'=>'lName', 'column2'=>'fName', 'column3'=>'gender')
With all those scpecifics mapped/defined, the uploaded spreadsheet can then be parsed and accurate data can be added to the product_history database.
Is ZF a good tool for this particular problem, or should I just write something from scratch?
How would you aproach this?
I am finally JUST BARELY starting to get this ZF stuff straight in my head, and this one has got me totally lost :)
Any and All advice appreciated.
~ Mo
I think in your case, using Zend_Form would be helpful for this situation.
The tricky part to it is of course that your forms are going to be largely dynamically generated on-the-fly based on the header and first row content of the CSV file.
Whether you used Zend_Form, or pure PHP, or some other solution, a lot of what you will be doing is the same (analyzing the CSV, providing dynamic inputs based on the CSV, and then error checking the selections). I think using Zend_Form has the advantage of making something like this very cleanly.
Given Zend_Form's nature, e.g. how it validates existing forms based on the elements added to the Zend_Form itself, you need to take a special approach with the form. Basically, after the user uploads the CSV once, you will create a Zend_Form object based on the number of columns, their positions in the CSV, and the name of the column.
Since you don't want to bother the user to upload the CSV multiple times if they make incorrect selections, I would parse the CSV into some sort of structure, maybe a simple object or array, and then build your Zend_Form based on that data. This way, you can save that structure to the session, so you can continue to regenerate the form based on the parsed data without having to read the file each time. This is because the main challenge with Zend_Form and dynamic forms, is that not only does the form need all of the elements and their properties when you want to display the form, but they are also required in order to validate the form and re-display the validated form.
I remember seeing this functionality many years ago in a PHP script, which I found is still available. Perhaps you could look at it for ideas. I won't post the link here since the screenshots and script are mostly adult website related and the site is NSFW for the most part, but it is called TGPX by JMBSoft. The 7th of the 8th screenshot on the main product page shows the import process where it lets the user map fields to data, exactly what you are doing.
Hope my advice is helpful, feel free to comment with any questions.