iText - Manipulate existing PDF - add dashes to end of each paragraph - itext

I need to manipulate existing PDF in iText to add dashes to the end of each paragraph. Something like this:
I would make this in Word with tab leaders.
Is this possible to do with iText on an existing document.
Any help would be greatly appreciated.
Thanks!
Edit for clarifications
iText version is 5.5.x, but I guess we can upgrade it if the task would be easier with newer version.
There could be some paragraph that do not need dashes, but I have some control of the original PDF. It is assembled from different system and I could add some kind of markers to the paragraphs that need leaders (ie. I can add text like "~tab~" at the end of such paragraphs).
At the moment the documents that need this kind of editing have headers and footer, nothing but the text and one column with justified alignment.
Edit for even more clarification
I can even (by configuration) set where the dashes has to end (ie. at 10px) for specific document. We know every document type (and its structure) that needs to be manipulated this way.

This is insanely hard.
You should think of a PDF document as a container of instructions, rather than a WYSIWYG format. So finding out where lines are (let alone paragraphs) is very hard.
High level plan:
use IEventListener to process events from the PDF being parsed
look out for TextRenderInfo events, store them
sort TextRenderInfo events to ensure your list of events is in logical reading order.
merge items in your list if they appear on the same line and are less than a certain distance apart (for instance the distance of 3 spaces in the font specified by TextRenderInfo)
Now you should have lines
Merge lines if they appear in close vertical proximity of eachother and they overlap horizontally. How close they should be, and how much they overlap is something you'll have to figure out, and might differ from page to page, and document to document.
now you should have paragraphs
Figure out the bounding box of each paragraph. Or more accurately, the convex hull. There is a good algorithm for this called the gift-wrapping algorithm.
Now you can simply insert lines by inspecting your convex hull. This is the easy step.
If you can insert markers, you can easily do this using iText7. iText7 has an implementation of IEventListener that allows you to look for regular expressions within a PDF document. It returns the locations where the regular expression was found. If you can ensure your markers always satisfy some kind of regular expression, you can easily look for them, get their coordinates, and insert a line at the calculated position.
Of course, then you need to get rid of the marker text.
For that you can use pdfSweep.

Related

Accurately Reading Document Content with Position Using Open XML (Word)

I have a need to retrieve a string which matches perfectly the content of a word document. That is to say, in position 1000 of this string, I should find exactly the text at position 1000 in the document.
We have been through various iterations of reading in the document context text and adjusting for field codes/tables/pictures/inline shapes etc by padding in the right places. This approach does work (well) but we want to move towards Open XML instead for speed.
We have Power Tools for Open Xml installed, and have been looking at ways to recreate this string using Open Xml. We can get all the text by going through the runs (as per Eric White's blogs), but we also need everything else. \r's, \t's etc. I see things like "TabChar" in runs, and wdfldChar, but I am unclear how to use this information to generically get what we want.
For example, "TabChar" in our string should be \t. We must need to interpret wdfldChar begin, separate, end in a certain way (maybe by adding spaces). The problem is that we don't want to have to find every possibility and code them
[If run = "TabChar" append."\t" etc] a) because it's inefficient, b) it is unsafe.
Can anyone help with a method to reproduce this string with complete accuracy?
Thanks

Defining what is a line in Tesseract

I'm working on document recognition for scanned bank statement. The statements that I have are organized by lines, such as the one attached. Because Tesseract does such a good job at detecting the areas of text, it breaks the lines in the middle (I'm assuming this is because of the large white space between the first block in the line (blurred for privacy reason), and the next one ('EUR', or 'COURS').
In the hocr file, the bbox of all the elements in the line are within 2px or so, so I could potentially rebuild a line myself. However, this seems more like a hack. Is there a way to tell Tesseract that lines should be as wide as the document itself? Or would there be another way to go about it? I've tried playing with the psm option, but with no luck.
-psm 6 -- Assume a single uniform block of text -- should work. If not, you may want to use the older version 2.0x, which does not perform page layout analysis.

Is there a way to detect when a field mark with 'can grow' has truncated the field data?

I do not normally work with crystal, but I have spent nearly 2 days looking for a way to do this.
The problem is that I have a number of lines of text that need to show on a report, but need to cut off after 8 lines and show a 'more' prompt to inform the user that they need to go look at the rest of the details online. This was originally handled by storing the data as individual lines already wrap to size and counting the lines with a formula and conditionally showing a separate 'more' field. They have since added the ability to use html to the text, but this made the current way of doing things wrap incorrectly and show the html mark up.
I wrote a database function to combine the text into a single field and use the HTML text interpretation to display it correctly on 7 other reports that do not limit the text length, and the max line count works great for limiting the text size, I just can't figure out how to show the 'more' prompt when needed.
Any suggestions would be greatly appreciated.
GrumpyGeek,
If your database function now combines the text into a single field does this mean the original way, with the separated lines, is still stored? If so, why not add another calculated field called 'line-count' that tallies the old line-based data?
So you'd still have your new combined HTML field and this new field that you could use to show the 'more' button when 'line-count > x'?
Alternatively, another option might work, but would be a bit touchy. That is to make the formula that shows the more button trigger when the field length exceeds x. The catch is that html mark-up isn't displayed, and heavy use of it would skew the amount of text required before you should show the 'more' button. Put another way, a field with very heavy use of mark-up ( and tags) might force the 'more' button earlier than it should. Unless you could somehow make either your 'line-count' calculated field exclude the mark-up OR make the length calculation do the same.
This would be possible if MSSQL or Crystal Reports could run regex to strip the mark-up.
If NONE of the above works, the only other thing I can suggest is to look into UDFs. Crystal allows you to load an external library that you write. These will read functions you write and show them in the function list inside Crystal. If you do this, then you could easily write a routine that strips the HTML and calculates when the more button should be shown.
Good luck with it.
Ideally, there would be a property of the DB field that would return its displayed line count. Unfortunately, there is no such property.
You could try counting the # of line ending characters (e.g. carriage return, line feed). If they are > 7 then show the hyperlink. In a HTML situation, you have to count ending elements (e.g. ). You could make use of a RegEx UFL to make it easier to identify the elements.
Probably the easiest route is to the DB to calculate the # of lines and return that as another field. Use this field to hide/show the hyperlink.

OO:Doc -perl module for Openoffic

I want to automate some writer tasks. I need to create a .odt writer
document with oo:doc using methods such as create paragraph and append
paragraph. The problem is that append paragraph and create paragraph does not
allow text to start at middle of page or at a certain column, ie
Name Surname Address
When I unzip the "master" document I want to to create, when I inspect the content.xml file i see the xml equivalent is
" <text:p text:style-name="Text_20_body"><text:s text:c="115"/><text:span text:style-name="T1"><text:s/>Hallo how are you today</text:span></text:p><text:p text:style-name="P1"><text:s text:c="116"/>I hope you are well also</text:p><text:p text:style-name="P1""
How do I set the text:c and text:s element(s) from within oo::doc
Question2:
How do i set the formatting of a paragraph
to only extend from ie column 20 to column 80
thanks
Those elements are for runs of non-breaking spaces. the text:c attribute says how many spaces there are.
That doesn't strike me as a solution to what you want, which is to change the margins and position of a paragraph, yes?
Do you have a document that you want to use as a template, where the text will be inserted? Or ar you trying to create the entire page from scratch?
I think you want to use OpenOffice.org to create a Writer document that has the structure you want, then look at the XML to see what the markup is that accomplishes that. Look at paragraph-level styles or even frames if that is what is used. You might be able to create insertion points for your generated content by then adding magic-text phrases that you can scan for.
Then figure out how to get that done with the perl module.

Diff tool to align shuffled lines

Suppose I have two documents that are identical except the lines are shuffled. Is there a tool that can show me which lines in document A correspond to which lines on document B by drawing lines to connect them (kinda like Cairo does for machine translation word alignments)?
What if the files have some level of differing lines (I don't want to figure out which lines are similar to each other -- if there isn't an exact match for a line, then that line has no match.)
Note: I am not looking to sort the files and compare them, rather I am looking to get a visualization of how far out of order the files are relative to each other, and which particular regions tend to move together, and which tend to be shuffled.
Windiff will show you the line in the left file it thinks the line in the right file came from, but it's often mistaken when lines are the same (e.g. a line with just a } in a cc file).
I just discovered psame in a google search which (at least algorithmically) does the same thing.