PDFs are not displaying apostrophes in field data inserted by iTextSharp - itext

I am using iTextSharp to fill pre-defined fields on an existing PDF document using the folowing syntax:
PdfStamper stamper = new PdfStamper(reader, stream);
stamper.AcroFields.SetField("A","O'Henry");
stamper.FormFlattening = true;
stamper.Close();
Unfortunately, apostrophes (and likely other forms of common punctuation) are not displayed in the output PDF. For instance, in the code above, field "A" displays the text "OHENRY" instead of "O'HENRY".
How do I get the output PDF to display the text including the apostrophes?
Also, please note that I do not have control over creating/modifying the original PDF being filled. I was given the PDF from an external source and will likely be given new versions of the PDF as the form changes.
Thanks!

An easy fix is to replace the single quotes with the ` character.

I found a solution here http://www.nabble.com/Populating-form-fields-with-Unicode-data-td21610346.html.
This solution involves embedding into each field a font that can handle the desired characters.

Related

"Missing" glyphs in iTextSharp 5 AcroFields [PowerShell]

I am using iTextSharp 5.5.13.2 in PowerShell - I'm trying to fill out a pdf form with pre-encoded string of characters, which will be read out as a Code128 barcode, after displaying it with a custom font. The template was made in Indesign and edited in Acrobat (pro), to provide properties such as text alignment, text size and font.
I am using the code below to fill out appropriate fields of the template:
$pdfReader = New-Object iTextSharp.text.pdf.PdfReader("path\to\template.pdf")
$currentRecord = $kat7_List[$l]
$currentId = $currentRecord.Split("`t")[0]
$currentCode = $currentRecord.Split("`t")[1]
$pdfStamper = New-Object iTextSharp.text.pdf.PdfStamper($pdfReader, [System.IO.File]::Create("$($tempPath)\$($l).pdf"))
$pdfStamper.AcroFields.SetField("id_Field", $currentId) | Out-Null
$pdfStamper.AcroFields.SetField("kod_Field", $currentCode) | Out-Null
$pdfStamper.FormFlattening = 0
$pdfStamper.Close()
$pdfReader.Close()
The problem is that when I open the output file, the "kod_Field" AcroField is missing glyphs (Ò and Ó, 0210 and 0211 in Unicode chars) - which correspond to start and end characters in this particular Code128 representation, like pictured here:
Barcode before clicking
When I click to edit it in Acrobat though, it suddenly regains these glyphs and the code is working (they are present in the font - I am using such system successfully in Indesign, with exactly the same encoded string values): Barcode after clicking
The coded strings are in UTF-16LE encoded text file, and should stay that way - I have tried setting BaseFont for the Stamper with BaseFont.IDENTITY_H, .CP1250 and .CP1252 - in either case the code came out even more "mangled" than now.
FormFlattening in the final version should be switched to true - I don't want open form fields in the final product. It flattens with missing glyphs, of course, but for testing purpose, I leave the flattening set to 0.
Which properties should I verify and what method governs encoding in this example? I though that .SetField() method should do the trick without much problem, since it's provided with properly encoded value. What might be the cause of such behavior?
Please help me out, regards.
--- Addendum, 17.08.2021 ---
I have temporarily changed the font to Myriad Pro in the template's kod_Field. It filled out properly, with no glyphs missing and proper string value. When I changed it back to code font via Acrobat in the finished pdf, it didn't have any missing glyphs. I really want to avoid using an additional operation (setting "textfont" property on the AcroField after setting the value) - because it should already be working "out of the box".

How convert PDF to ACROFORM type?

I use pdftk for filling forms.
and now when I enter
F:\GoogleDisk\projects\comparepdfs>pdftk new/file.pdf
fill_form new/b2bf7150aa9de8b2ef8edd20a5677f7f.fdf output new/temp_b2bf7150aa9de8b2
ef8edd20a5677f7f.pdf
returned
Warning: input PDF is not an acroform, so its fields were not filled.
How fix it or convert PDF to acroform?
I decided it.
Combine files in Acrobat - and it create new pdf.
New pdf is good.

Get text properties from PDF file

How can I get text properties using PDF::API2 or CAM::PDF? I need font size and style info.
Something like (from CAM::PDF)
$pdf->getPageContent(1);
but with text info in it.
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
UPDATE
Read abit more in http://search.cpan.org/dist/CAM-PDF/lib/CAM/PDF.pm
But there are methods like:
$self->getFontNames(pagenum)
And others which may prove helpful.

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

Formatting Field values using itextsharp

how can i have a string format "i am fine here" using itextsharp
fields.SetField("tgPara2", message2);
where message is "i am fine here" i want the fine word only to be bold.
Any help would be
iText has partial support for "rich text values" for text fields. You can get and set the rich values, but iText won't actually draw those values properly. You need to turn off SetGenerateAppearances, and open the PDF in Acrobat/Reader to see the rich text.
This means flattening isn't going to work (unless you open the PDF in Acrobat, then save it again... clunky).
You might want to check out the PDF Specification (Section 12..7.3.4 Rich Text Strings) for further information on what is and isn't legal. <b> is legal, as is the font-weight CSS style