I am using
itext -> LocationTextExtractionStrategy for retrieving the text from PDF,
I have read 2 different pdf and debug them
For 1st one i have found in
public void renderText(TextRenderInfo renderInfo)
method text is rendering with word
e.g: I have a pdf with some content
ACCOUNT TYPE A/C. BALANCE (I) FIXED DEPOSITS (LINKED) BAL. (II)
then renderText method rendering text in a loop like:
ACCOUNT TYPE then A/C. BALANCE (I) and then FIXED DEPOSITS (LINKED) BAL. (II)
Now when i the debugging 2 pdf with content and it's rendering with letter e.g i have content:
Date Details Withdrawals
then renderText method rendering text in a loop:
D then a then t then e and so on
I am wondering how it renders the text(means some time iterate with word and some time with a group of words and some time iterate with just a letter )?
how it renders the text(means some time iterate with word and some time with a group of words and some time iterate with just a letter )?
The iText parsing framework forwards the atomic strings used in arguments of PDF text drawing operations.
Thus, if the PDF draws text letter by letter, you'll receive one TextRenderInfo instance per letter. If it draws text word by word, you'll receive one instance per word.
Related
I'd like to include an image into a mail merged word document based on the presence of a single value in a column which contains several values.
e.g. if the cell contains the value BOB insert image, if it contains any other value then do nothing.
Most of the {INCLUDEPICTURE} functionality seems built around including a different image based on a filename matching a cell value.
{INCLUDEPICTURE} "MERGEFIELD Selection_identifier).png"\*
MERGEFORMAT \d }
Works provided I translate selection_identifer in the spreadsheet itself, but there has to be a better way. There seems to be little information about this particular usecase online.
If you are only using a single image and it does not vary between merges, you should probably just use
{ IF "{ MERGEFIELD Selection_identifier }" = "BOB" "<the_image>" }
where <the_image> is a copy of the actual image, sized how you want, pasted between those quotation marks. In that case, there would be no need for an INCLUDEPICTURE field or a reference to an external image file.
As usual, all the {} have to be the special field code brace pairs that you can insert on Windows Desktop Word using Carl-F9 or similar.
Hi All,
This is a question related to itextsharp version 5.5.13.1. I am using a custom LocationTextExtractionStrategy implementation to extract sensible words from a PDF document. I am calling the method GetSingleSpaceWidth of TextRenderInfo to determine when to
join two adjacent blocks of characters into a single word as per the SFO link
itext java pdf to text creation
This approach has generally worked well. However, if you look at the attached document, the words "Credit" and "Extended" is giving me some problems.
Why are all the characters shown encircled in the screen capture returning a zero value for GetSingleSpaceWidth? This causes a problem . Instead of two separate words, my logic returns me one word "CreditExtended".
I understand that itextsharp5 is not supported any more. Any suggestions would be highly appreciated?
Sample document
https://drive.google.com/open?id=1pPyNRXvnUyIA2CeRrv05-H9q0sTUN97d
As already conjectured in a comment, the cause is that the font in question does not contain a regular space glyph, or even more exactly, does not map any of its glyphs to the Unicode value U+0020 in its ToUnicode map.
If a font has a ToUnicode map, iText uses only the information from that map. Thus, iText does not identify a space glyph in that font, so it cannot provide the actual SingleSpaceWidth value and returns 0 instead.
The font in question is named F5 and has this ToUnicode map:
/CIDInit /ProcSet findresource begin
14 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<0004> <0041>
<0012> <0043>
<001C> <0045>
<002F> <0049>
endbfchar
1 beginbfrange
<0044> <0045> <004D>
endbfrange
13 beginbfchar
<0102> <0061>
<0110> <0063>
<011A> <0064>
<011E> <0065>
<0150> <0067>
<015D> <0069>
<016F> <006C>
<0176> <006E>
<017D> <006F>
<0189> <0070>
<018C> <0072>
<0190> <0073>
<019A> <0074>
endbfchar
5 beginbfrange
<01C0> <01C1> <0076>
<01C6> <01C7> <0078>
<0359> <0359> [<2026>]
<035A> <035B> <2018>
<035E> <035F> <201C>
endbfrange
1 beginbfchar
<0374> <2013>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
As you can see, there is no mapping to <0020>.
The use of fonts in this PDF page is quite funny, by the way:
Its body is (mostly) drawn using Calibri, but it uses two distinct PDF font objects for this, F4 which uses WinAnsiEncoding from character 32 through 122, i.e. including the space glyph, and F5 which uses Identity-H and provides the above quoted ToUnicode map without a space glyph. Each maximal sequence of glyphs without gap is drawn separately; if that whole sequence can be drawn using F4, that font is used, otherwise F5 is used.
Thus, CMI, (Credit, and sub-indexes are drawn using F4 while I’ve, “Credit, and Extended” are drawn using F5.
In your problem string “Credit Extended”, therefore, we see two consecutive sequences drawn using F5. Thus, you'll get a 0 SingleSpaceWidth both for the “Credit t and the Extended” E.
At first glance these are the only two consecutive sequences using F5, so you have that issue only there.
As a consequence you should develop a fallback strategy for the case of two consecutive characters both coming with a 0 SingleSpaceWidth, e.g. using something like a third of the font size.
Let's set up the scenario. This is a text file:
INTRODUCTION
Paragraph 01
Paragraph 02
Paragraph 03
SECTION XXXXXX
Paragraph 01
Paragraph 02
SECTION YYYYYY
Paragraph 01
SECTION ZZZZZZ
Paragraph 01
Paragraph 02
Paragraph 03
Each paragraph can contain more paragraphs inside, but let's keep things simple.
We want to programatically build text files like those by following simple rules:
Sections are always present, containing at least one paragraph
Any paragraph can be shown or not based on one or more conditions. Conditions are defined as code evaluated against a context (think about eval function in Python, for example).
Context will be provided at runtime
Paragraphs can start with a number. So, we can not put the numbers as part of the paragraph (paragraphs can be present or not, and their numbers must follow a correct sequence: 1, 2, 3, etc)
The type of texts to be built are terms of use, privacy policy, etc. As simplification, legal texts containing more or less content based on web form responses.
EDIT: The text is generated separately from the forms. We only have the responses.
So, my approach is to store the chunks of text (paragraphs) as columns of a database. Each column along with:
Position inside its section
Code to evaluate as condition(s). The context to evaluate that code against will be provided at runtime, as said.
As we can have one or more conditions to evaluate to determine if a chunk of text will be included or not in the final text, I'm not sure about what kind of data structure to use.
Relational database? The number of columns would be dynamic, due to the existence of an initially unlimited number of conditions to be evaluated in each case.
NoSQL database?, storing the structure as a JSON, containing text + array of conditions?
Any other approach?
I solved the problem with a different approach.
We build the text file in the frontend, by drag&dropping 2 types of widgets into a canvas (representing the content of the text file):
Simple text chunks
Conditional widgets: they'll show/hide a text chunk based on a condition evaluated in the frontend at runtime (that's the key). Something similar to Angular ng-if.
Then we compile the full text, also in the frontend.
So, the backend just stores chunks of text. No need to translate conditional logic to database. It even sounds silly now...
I would like to create a Word document with multiple text boxes that serve as name badges. Each name appears twice, once set normally and once rotated by 180 degrees. Later, they are printed on paper, cut and folded, so they can stand on a table.
I am using docx4j to generate the DOCX file. My idea is to have a fragment in one Word file that serves as a template for a single name badge. I'd like to load that template and fill the placeholders with real names. Multiple fragments are then concatenated and written to a second Word template, so ultimately I have a list with multiple name badges. The paper I use allows two name badges on each page (i.e. four text boxes).
However, I fail implementing this with docxj4. This is what I tried:
(1) First, I tried filling a single name badge. The placeholders got filled correctly, but the rotated text box of my Word template (input) became an ordinary (not rotated) text box in the output file.
(2) I also tried MainDocumentPart#addParagraph(String) and created the entire paragraph myself, using XML code generated in Word (where the text box was rotated). The output generated by docx4j, however, again did not respect the rotation. It created two text boxes, but when I view them in Word now, I even cannot rotate them there anymore. It seems like the generated text boxes are different from those created by Word in the first place.
Long story short, how can I create a rotated text box with docx4j?
It would be very convenient, if I could have a Word template to layout the name badges, but if I would have to create the entire thing programatically, it would be fine, too. Also, other ways of rotating text would be okay for me. But it seems like text boxes are the only objects in Word that can actually be rotated by 180 degrees.
How do I create document info dictionary keys containing unicode characters (typically swedish characters, for instance C3A4 U+00E4 ä). I would like to use the PdfStamper to enter my own metadata in the document info dictionary, but I can't get it to accept the swedish characters.
Entering custom metadata using Acrobat works fine and looking at the PDF in a text editor I can see that the characters get encoded like for instance #C3#A4 for the character mentioned above. So is there a way to achieve this programmatically using iText PdfStamper???
regards
Mattias
PS. There is no problem having unicode characters in the info dictionary values, but the keys are a different story.
Please take a look at the NameObject example, and give it a try. You'll see that iText automatically escapes special characters in names.
iText follows the ISO-32000-1 specification that stats (7.3.5, Name Objects):
Beginning with PDF 1.2 a name object is an atomic symbol uniquely
defined by a sequence of any characters (8-bit values) except null
(character code 0). Uniquely defined means that any two name objects
made up of the same sequence of characters denote the same object.
Atomic means that a name has no internal structure; although it is
defined by a sequence of characters, those characters are not
considered elements of the name.
not part of the name but is a prefix indicating that what follows is a
sequence of characters representing the name in the PDF file and shall
follow these rules:
a) A NUMBER SIGN (23h) (#) in a name shall be written by using its
2-digit hexadecimal code (23), preceded by the NUMBER SIGN.
b) Any character in a name that is a regular character (other than
NUMBER SIGN) shall be written as itself or by using its 2-digit
hexadecimal code, preceded by the NUMBER SIGN.
c) Any character that is not a regular character shall be written
using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.
NOTE 1: There is not a unique encoding of names into the PDF file
because regular characters may be coded in either of two ways.
White space used as part of a name shall always be coded using the
2-digit hexadecimal notation and no white space may intervene between
the SOLIDUS and the encoded name.
Regular characters that are outside the range EXCLAMATION MARK(21h)
(!) to TILDE (7Eh) (~) should be written using the hexadecimal
notation.
The token SOLIDUS (a slash followed by no regular characters)
introduces a unique valid name defined by the empty sequence of
characters.
NOTE 2 The examples shown in Table 4 and containing # are not valid
literal names in PDF 1.0 or 1.1.
I'm not copy/pasting table 4, but I don't see any example that uses characters that consist of two bytes. Can you share a PDF that contains a name with a two-byte character that behaves in the way you desire? The PDF specification explicitly says that characters in the context of names are 8-bit values. You seem to be talking about 16-bit values...
Additional note: in the current implementation of iText, we only look at 8 bits:
c = (char)(chars[k] & 0xff);
We deliberately throw away all the higher bits when characters with more than 8 bits are passed.
Actually, I think I have answered your question. Initially, I thought you were asking to add this character: http://www.fileformat.info/info/unicode/char/c3a4/index.htm
As it turns out, you only need "\u00e4" (ä). I've made a small code sample that demonstrates how one would add a custom entry to the DID containing this character: ChangeInfoDictionary.
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
Map<String, String> info = reader.getInfo();
info.put("Special Character: \u00e4", "\u00e4");
stamper.setMoreInfo(info);
stamper.close();
reader.close();
}
Granted, when you open the PDF in a PDF viewer, you don't necessarily see "Special Character: ä" as the key value, but that's a problem of the PDF viewer. When you open the PDF in a text editor, you clearly see:
/Special#20Character:#20#e4(ä)
Which means that iText has correctly escaped the special character.
However: as you pointed out in your comment, the character doesn't show up in Adobe Reader. Based on a PDF I created using Acrobat, I found a workaround by using the following code:
StringBuffer buf = new StringBuffer();
buf.append((char) 0xc3);
buf.append((char) 0xa4);
info.put(buf.toString(), "\u00e4");
Now the character is shown correctly. In other words: it's a matter of encoding...
Just wanted to share a little experiment in C# illustrating one rather effortless way of getting the special characters into the document info dictionary keys.
string inputString = "My key with åäö";
byte[] inputBytes = Encoding.UTF8.GetBytes(inputString);
string convertedString = Encoding.UTF7.GetString(inputBytes);
info.Add(convertedString, "My value with åäö");
(info is the Dictionary used for adding metadata) Then just use the PdfStamper to get the info into the PDF. The metadata is stored correctly in the PDF and can be interpreted by Adobe Reader.