Defining what is a line in Tesseract - tesseract

I'm working on document recognition for scanned bank statement. The statements that I have are organized by lines, such as the one attached. Because Tesseract does such a good job at detecting the areas of text, it breaks the lines in the middle (I'm assuming this is because of the large white space between the first block in the line (blurred for privacy reason), and the next one ('EUR', or 'COURS').
In the hocr file, the bbox of all the elements in the line are within 2px or so, so I could potentially rebuild a line myself. However, this seems more like a hack. Is there a way to tell Tesseract that lines should be as wide as the document itself? Or would there be another way to go about it? I've tried playing with the psm option, but with no luck.

-psm 6 -- Assume a single uniform block of text -- should work. If not, you may want to use the older version 2.0x, which does not perform page layout analysis.

Related

ImageMagick: How to batch append 4 parts of images into one (2 rows, 2 columns) (I have 500+ images that need to be combined like this)

everyone!
I am using ImageMagick-7.0.10-Q16 on Windows 10. I’ve tried Googling for answers, but I’m still left very confused about how to do this. Most of the answers have been for UNIX and not Windows, I have no idea what it means, or given me errors. I don’t have any experience with coding or Windows PowerShell, so forgive my slowness
I have scanned pages of books that have been split into four pieces of jpg files. The images are named after the page number and the orientation of the corresponding piece. BL=Bottom left. BR=Bottom right. TR=Top right. TL=Top left. (BM=Bottom pieces merged. TB=Top pieces merged). So “BL0001.jpg" is the bottomleft piece of page 1. I’m not mentioning their sizes because I don’t want them to be resized or whatever. I just want them to be combined via append like a puzzle like this:
Combined jpg pieces.
The borders and the text-boxes there are just to demonstrate, and are not to be included
So the files are for example like this:
BL0001.jpg
BR0001.jpg
TL0001.jpg
BR0001.jpg
BL0002.jpg
BR0002.jpg
TL0002.jpg
BR0002.jpg
And so on...
This was the last thing I’ve tried in Windows PowerShell:
magick convert B*0001.jpg +append 0001BM.jpg
magick convert T*0001.jpg +append 0001TM.jpg
magick convert 0001*.jpg +swap -append 0001merged.jpg
This combines 4 parts into one image just like I want it to. I found out adding * works like a wildcard and merges all the images like BR and TR together in one go. But I can’t do that for the page number (in this case ‘0001’ in ‘B*0001.jpg’), because that would merge all the files in the folder into the same image, something I don’t want. So what I want to figure out is to how to “batch” run this command for with a sequential numbering system for the different pages. In other words, use a command to batch combine pieces of an image into one image, but with all the scanned pages in jpg in the folder. I know the commands above create addition files with the merged top and bottom parts before the final merge, but I don’t know how to make this command otherwise. I'm willing to try other commands/things too
Using ImageMagick v7 in a simple Windows BAT script you could do something like this...
#echo off
setlocal EnableDelayedExpansion
for /l %%n in ( 1 1 9999 ) do (
set V1=000%%n
set V1=!V1:~-4!
magick *!V1!.jpg +append -crop 2x1# +swap -append +repage !V1!merged.jpg
)
exit /b
That uses a "for" loop to read all four "*0001.jpg" images at a time into an ImageMagick command. The "set V1=" lines are to make sure the variables have the correct number of leading zeros.
The IM command appends, crops, and appends the four images into the properly ordered output, and writes the image as "0001merged.jpg". Then it moves on to process "*0002.jpg" and so on.
I put a top limit on the number of image sets to process with that "9999" in the "for" command to work with the number of leading zeros. Make sure that number is the same or more than the number of image sets you have. It will just print an error for each loop after it goes over the number of image sets, but no harm done.
Note: Using ImageMagick v7 you should just use "magick" because when you use "magick convert" it emulates IMv6 behavior. You probably won't usually want that.

iText - Manipulate existing PDF - add dashes to end of each paragraph

I need to manipulate existing PDF in iText to add dashes to the end of each paragraph. Something like this:
I would make this in Word with tab leaders.
Is this possible to do with iText on an existing document.
Any help would be greatly appreciated.
Thanks!
Edit for clarifications
iText version is 5.5.x, but I guess we can upgrade it if the task would be easier with newer version.
There could be some paragraph that do not need dashes, but I have some control of the original PDF. It is assembled from different system and I could add some kind of markers to the paragraphs that need leaders (ie. I can add text like "~tab~" at the end of such paragraphs).
At the moment the documents that need this kind of editing have headers and footer, nothing but the text and one column with justified alignment.
Edit for even more clarification
I can even (by configuration) set where the dashes has to end (ie. at 10px) for specific document. We know every document type (and its structure) that needs to be manipulated this way.
This is insanely hard.
You should think of a PDF document as a container of instructions, rather than a WYSIWYG format. So finding out where lines are (let alone paragraphs) is very hard.
High level plan:
use IEventListener to process events from the PDF being parsed
look out for TextRenderInfo events, store them
sort TextRenderInfo events to ensure your list of events is in logical reading order.
merge items in your list if they appear on the same line and are less than a certain distance apart (for instance the distance of 3 spaces in the font specified by TextRenderInfo)
Now you should have lines
Merge lines if they appear in close vertical proximity of eachother and they overlap horizontally. How close they should be, and how much they overlap is something you'll have to figure out, and might differ from page to page, and document to document.
now you should have paragraphs
Figure out the bounding box of each paragraph. Or more accurately, the convex hull. There is a good algorithm for this called the gift-wrapping algorithm.
Now you can simply insert lines by inspecting your convex hull. This is the easy step.
If you can insert markers, you can easily do this using iText7. iText7 has an implementation of IEventListener that allows you to look for regular expressions within a PDF document. It returns the locations where the regular expression was found. If you can ensure your markers always satisfy some kind of regular expression, you can easily look for them, get their coordinates, and insert a line at the calculated position.
Of course, then you need to get rid of the marker text.
For that you can use pdfSweep.

Editing a Text File from Matlab

I'm still getting used to Matlab, and not sure if this is possible using Matlab or not, but it's just something that popped into my head that I thought could be interesting.
Is there any way to edit the contents of a text file in Matlab?
Moreover, is there any way to edit specific parts of the text file without altering the rest?
To elaborate, let's say I had a text file that was several lines long. For instance:
This is a hypothetical text file.
The cat chased a mouse.
The mouse ran into a hole.
The cat tried to paw at the mouse.
The mouse waited in the hole until the cat got bored.
The mouse came back out when the cat left.
Is there any way to use Matlab to exclusively edit, say, line 6 and change it from "The mouse waited in the hole until the cat got bored" to "The mouse fell asleep and the cat got bored", without having to change the rest of the file?
I know of several methods to read and display contents of text files using Matlab, but I'm not sure if there's any way to actually edit the text files in Matlab.
Thanks!
As far as I know, you will always have to read the file line by line (for instance into a cell-array) and edit it as you need. After that, you write a new file or overwrite the old one.
Of course, you can encapsulate this procedure and then call you own function like
manipulateFile(lineNumber, newLineText)
Some commands that may come in handy are fopen, fscanf, textread, fprintf, and fclose.

Highlight pdf line

Please can any one help me. I am really stuck I don't know how to highlight particular line of pdf. It would be better if any one can provide me sample code or pseudo code
Thanks
This is not trivial.
To do this, I'd render the PDF contents into one layer, and somehow get the position of the said line/object using the CoreGraphics PDF parser (or some other way). After that, you highlight the said object using your own drawing code.
Just highlighting a particular line is quite difficult.
If you need search and highlight, please try FastPDFKit. I played with it for a while and it's quite good as a pdf reader.
http://mobfarm.eu/fastpdfkit
I'm working on the same thing at the moment and it's not trivial indeed.
From what I can figure out you need to load the text and arrange it in lines first. If you are using Poppler, the Poppler.Page.textList() will provide you with a list of TextBoxes and a TextBox.hasSpaceAfter() will tell you the end of line when returning False.
I am using the Qt4 frontend, so the each TextBox has a QRect from which I can figure out where to highlight a word. Highlighting a line is more or less lirstWordOfLine.geometry().united(lastWordOfLine.geometry()) which will provide the geometry of the line to highlight.
Now what I can't figure out is how to save the coordinates of the highlights in the document.

Diff tool to align shuffled lines

Suppose I have two documents that are identical except the lines are shuffled. Is there a tool that can show me which lines in document A correspond to which lines on document B by drawing lines to connect them (kinda like Cairo does for machine translation word alignments)?
What if the files have some level of differing lines (I don't want to figure out which lines are similar to each other -- if there isn't an exact match for a line, then that line has no match.)
Note: I am not looking to sort the files and compare them, rather I am looking to get a visualization of how far out of order the files are relative to each other, and which particular regions tend to move together, and which tend to be shuffled.
Windiff will show you the line in the left file it thinks the line in the right file came from, but it's often mistaken when lines are the same (e.g. a line with just a } in a cc file).
I just discovered psame in a google search which (at least algorithmically) does the same thing.