Cannot retrieve the fields name from form PDF using pdftk - pdftk

When I try to using
pdftk my.pdf dump_data_fields >result.txt
have empty data result

Your file my.pdf may not be compatible with pdftk. Convert the file first using the following command:
>pdftk my.pdf output my_converted.pdf
Then try,
>pdftk my_converted.pdf dump_data_fields > result.txt
I've taken this from the following http://www.fpdf.org/en/script/script93.php where the converting process is suggested when the fields won't write to the pdf file so converting before dumping the fields may not help.
If your pdf has fields you it should be fillable in your pdf viewer. If in isn't fillable then it would seem that it has no fields.

This is most likely because the pdf you are using doesn't have any data fields to dump! Use a tool like Adobe Acrobat to open the pdf, go to wherever you need to to Edit Fields, and add fields anywhere you need them to show up. Make sure they are named so you can utilize them by using the attributes[] call in pdftk.
I recommend using snake case (i.e. text box named 'first_name') and then you should have access to it using attributes[:first_name] = 'your text'.
Hope this helps, let me know if you have any other questions/issues.

Related

manipulating Microsoft Word DOCX files that have links and track changes using Python

I have been using the excellent python-docx package to read, modify, and write Microsoft Word files. The package supports extracting the text from each paragraph. It also allows accessing a paragraph a "run" at a time, where the run is a set of characters that have the same font information. Unfortunately, when you access a paragraph by runs, you lose the links, because the package does not support links. The package also does not support accessing change tracking information.
My problem is that I need to access change tracking information. Or, more specifically, I need to copy paragraphs that have change tracking indicated from one document to another.
I've tried doing this at the XML level. For example, this code snippet appends the contents of file1.docx to file2.docx:
from docx import Document
doc1 = Document("file1.docx")
doc2 = Document("file2.docx")
doc2.element.body.append(doc1.element.body)
doc2.save("file2-appended.docx")
When I try to open the file on my Mac for complicated files, I get this error:
But if I click OK, the contents are there. The manipulation also works without problem for very simple files.
What am I missing?
The .element attribute is really an "internal" interface and should be named ._element. In most other places I have named it that. What you're getting there is the root element of the document part. You can see what it is by calling:
print(doc2.element.xml)
That element has one and only one w:body element below it, which is what you get when with doc2.element.body (.xml will work on that too, btw, if you want to inspect that element).
What your code is doing is appending one body element at the end of another w:body element and thereby forming invalid XML. The WordprocessingML vocabulary is quite strict about what element can follow another and how many and so forth. The only surprise for me is that it actually sometimes works for you, I take it :)
If you want to manipulate the XML directly, which is what the ._element attribute is there for, you need to do it carefully, in view of the (complex) WordprocessingML XML Schema.
Unlike when you stick to the published API, there's no safety net once ._element (or .element) appears in your code.
Inside the body XML can be relationships to external document parts, like images and hyperlinks. These will only be valid within the document in which they appear. This might explain why some files can be repaired.

Visually identify name of field in PDF form

I know some similar issues exist (Find the field names of inputtable form fields in a PDF document?) but my question is different:
I have all the field names (in fdf file).
I wish I could visually identify directly on the PDF.
With acrobat I should be able to right click on a field and then select "display the name of the field" but I can find no such thing.
Can someone help me ?
Ok. I have found pdf editor where this is possible. Probably acrobat pro too...
http://www.pdfescape.com/
Right click on the field : unlock. Right click again : get properties.
If you're using Apache PDFBox to fill the form automatically, you can use it to fill all text fields with their name:
final PDDocument document = PDDocument.load(in);
final PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
final Iterator<PDField> it = acroForm.getFieldIterator();
for (PDField f : acroForm.getFields()) {
System.out.println(f.toString());
if (f instanceof PDTextField) {
f.setValue(f.getFullyQualifiedName());
}
};
document.save(...);
When you open the generated PDF, you'll be able to identify each field immediately like you asked.
There's a free tool that does this job quite well.
sudo apt install pdftk
You can use pdftk's dump_data_fields to get all fields like this:
pdftk sample.pdf dump_data_fields output fields.txt
Dump would look something like this:
FieldType: Text
FieldName: CRIssue
FieldFlags: 8388608
FieldValue: Issue with something ---- if it is filled
FieldJustification: Center
If you're looking for a tool where you can load your editable pdf file, click on the input (text or checkbox) field and get the basic info like Name: https://code-industry.net/masterpdfeditor/. It's available cross-platform.

Purpose for Word Open XML and content controls binding

For word report generation, I am looking at binding XML to content controls to see if it is any easier than to use Word Interop and hardcode index reference to content controls to assign values to them.
However, I don't really understand how to do it.
My work flow is entering information in Excel and then generate an XML file to have content controls populated by XML, however, what I read is the other way round: Word Control Control Toolkit and descriptions where the XML is populated by user entering information in Word, and then programmer to unzip docx file to retrieve the XML file.
How can I populate content controls with XML?
There are samples on generating Word documents from Word templates, XML and data bound content controls # http://worddocgenerator.codeplex.com/
Set up the mapped content controls in the 'template' docx using the content control toolkit or similar. Do this using a sample XML file containing your Excel data.
Now you have that template document, at run time you can inject your XML file into it (ie replace the custom xml part it contains, with your instance data), in C# or Java or whatever.
When the user opens the document in Word 2007/2010, the information in the custom XML part will automatically be copied into the bound controls, and visible to the user.
Note that content control data binding doesn't easily support repeating data (eg populating table rows) in Word 2007/2010, though there are ways to do it.

Exporting data in enhanced Grid to csv or xml format using dojo

In my project we are using dojo framework in UI. We are having a functionality to exporting the data in the enhanced grid into excel/csv files. In the dojo toolkit, they are binding the id in the textarea but i need those values in the excel/csv file...can any one help in this issue...? if possible pls tell me how to export the enhanced grid data to excel/csv files...
If you are already using the Enhanced Data Grid, you should be able to include the exporter plugin - dojox.grid.enhanced.plugins.exporter.CSVWriter - to get the CSV text.
This will give you access to two main functions exportGrid and exportSelected that will take the contents and export them as CSV text.
Unfortunately that doesn't get them as a separate file (click to download), just the formatted text in a textarea (or whatever).
To get a "click to download CSV function), you could write a servlet/jsp proxy, which would take a POST from your page with the CSV text (from the plugin above) as part of the form and simply copy it back out with the correct headers to make it appear as an attachment.
response.setContentType("text/csv"); response.setHeader("Content-Disposition","attatchment;filename=name.csv")
This would require something server side though.. and at that point, you may want to consider having a servlet simply produce the CSV text directly.
http://dojotoolkit.org/reference-guide/dojox/grid/EnhancedGrid/plugins/Exporter.html

Making a PDF output in raster format instead of vector using itextsharp

I have written C# code to save product specifications to a PDF document using iTextSharp, mainly with PdfPTable and Chunks/Paragraphs in the PdfPCells. However, I have been told that the output is unacceptable due to the fact that you can highlight and copy the text from the document and document storage and retrieval server software that they are currently using does not support "Vector" based PDFs. I'm not exactly certain what the difference is between a raster pdf and and vector pdf. Basically, every page of the PDF Document should be an image so that the text can not be highlighted. Is there any way to do this without using the DirectContent? Below is an image, illustrating a portion of the PDF that was created, and how the text can be selected and copied, which is the incorrect functionality.
I would like to avoid directly writing to the canvas, unless there is a way to do this and still have itextsharp handle my formatting and proper paging.
The windows application PDF2R works well, but doesn't seem to offer any programmatic solutions. I have found libraries that stated that they do this sort of conversion, but are several thousand dollars. I'd like to work within my budget and use the itextsharp or something much cheaper than this.
I would suggest you try to generate an image using the System.Drawing class and then insert that into the PDF document.
Call this code on your PdfWriter object:
writer.SetEncryption(PdfWriter.STRENGTH40BITS, null, null, PdfWriter.AllowPrinting);
This won't prevent users from selecting text, but it will prevent them from copying and pasting it. Give it a try.