tesseract-ocr - To extract table - tesseract

I am new to tesseract ocr. I am making use of Google api to extract words and lines from image. I want to extract tables/horizontal & vertical lines. I tried with FindLinesCreateBlockList method. It returns BLOCK_LIST type. I am not aware how to print values from BLOCK_LIST.
Is FindLinesCreateBlockList the right method to extract tables/lines?

Related

Mailmerge single image into a Word Document based on a cell value

I'd like to include an image into a mail merged word document based on the presence of a single value in a column which contains several values.
e.g. if the cell contains the value BOB insert image, if it contains any other value then do nothing.
Most of the {INCLUDEPICTURE} functionality seems built around including a different image based on a filename matching a cell value.
{INCLUDEPICTURE} "MERGEFIELD Selection_identifier).png"\*
MERGEFORMAT \d }
Works provided I translate selection_identifer in the spreadsheet itself, but there has to be a better way. There seems to be little information about this particular usecase online.
If you are only using a single image and it does not vary between merges, you should probably just use
{ IF "{ MERGEFIELD Selection_identifier }" = "BOB" "<the_image>" }
where <the_image> is a copy of the actual image, sized how you want, pasted between those quotation marks. In that case, there would be no need for an INCLUDEPICTURE field or a reference to an external image file.
As usual, all the {} have to be the special field code brace pairs that you can insert on Windows Desktop Word using Carl-F9 or similar.

How extract text geometry using PyPDF2?

I have pdf documents.
And it's clear to me how to extract text from it.
I need to extract not only text but also coordinates associated with this text.
It's my code:
from PyPDF2 import PdfReader
pdf_path = 'docs/doc_3.pdf'
pdf = PdfReader(pdf_path)
page_1_object = pdf.getPage(1)
page_1_object.extractText().split("\n")
The result is:
['Creating value for all stakeholders',
'Anglo\xa0American is re-imagining mining to improve people’s lives.']
I need geometries associated with extracted paragraphs.
Might be something like this for example:
['Creating value for all stakeholders', [1,2,3,4,]]
'Anglo\xa0American is re-imagining mining to improve people’s lives.', [7,8,9,10]]
How I can accomplish it?
Thanks,
Currently that ability is not a PyPDF2 feature, it has the ability for parsing the content as you show extractText() but does not hold the separate glyph xy positions nor output the lines coordinates.
There are other means in python to extract a single or multiple groups of letters that form words.
Using shell commands such as poppler from / in conjunction with a text "word" from PyPDF2 is possible, however the norm would be to run with another Py PDF Lib such as PyMuPDF and here is such an article, https://pyquestions.com/find-text-position-in-pdf-file for highlighting with PyMuPDF input.
The most common means to your goal is probably as described here How to extract text and text coordinates from a PDF file?

Apache fop : Unable to handle if single Word in text is larger than the containing block

I am new to fop , will be greatful if i get help from someone...,
I am not using XSLT tranformation but creating XSLFO file directly using Java code. Everything works fine but the problem comes when particular word(long text without space) is inserted into a cell of a table-column . That bigger word is overlapping the successive block.
I have an fo:block element in fo:table-cell which is in fo:table-row of a fo:table. This table has 6 columns, obviously column width is small. Now, when a Word in the block is larger than the block it is overlapping the next block. Give me some attribute value or any other solution to change my XSLFO file ,so that the bigger word breaks into the new line at end of the column.
Thanks in advance...
The things you need to look into are:
For the fo:table-cell: number-columns-spanned="3"
For setting the width of fo:table-column: column-width="proportional-column-width(1.5)"
The number-columns-spanned is used as attribute of and provides you with a means to select a bigger area where your fo:block fits.
The column-width makes it easy to define absolute width or, when using proportional-column-width, a width relative to the other columns.
I don't know of a way that lets FOP break words into multiple parts when they don't fit.

Is it possible to add a list to an irregular column?

I am using iTextSharp to programmatically generate a PDF file.
I am using an irregular column (a ColumnTextin text mode) and am having difficulty adding an unordered list to the column.
The ColumnText class's AddText method only accepts a Phrase or Chunk, so I can't add a List directly. I have tried adding a List to a Paragraph and then adding the Paragraph to the ColumnText but the result is the list items appear concatenated one after another and not as a bulleted list.
Can this be done or do I need to explore an alternative route?
ANSWER: There is no support in iText to add lists to irregular columns. However, it is possible to render lists in irregular columns with a bit of legwork. I blogged about it here: Rendering Lists in Irregular Columns Using iText / iTextSharp.
With AddText(), you're using ColumnText in text mode. If you want to use a List, you should use ColumnText in composite mode. You can switch from text mode to composite mode by replacing AddText() by AddElement().

updating line in large text file using scala

i've a large text file around 43GB in .ttl contains triples in the form :
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://la.dbpedia.org/resource/Mahatma_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
and i want to find the fastest way to update a specific line inside the file without rewriting all next text. either by updating it or deleting it and appending it to the end of the file
to access the specific line i use this code :
val lines = io.Source.fromFile("text.txt").getLines
val seventhLine = lines drop(10000000) next
If you want to use text files, consider a fixed length/record size for each line/record.
This way you can use a RandomAccessFile to seek to the exact position of each line by number: You just seek to line * LineSize, and then update it.
It will not really help, if you have to insert a new line. Other limitations are: The file size will grow (because of the fixed record length), and there will always be one record which is too big.
As for the initial conversion:
Get the maximum line length of the current file, then add 10% for example.
Now you have to convert the file once: Read a line from the text file, and convert it into a fixed-size record.
You could use a special character like | to separate the fields. If possible, use somthing like ;, so you get a .csv file
I suggest padding the remaining space it with spaces, so it still looks like a text file which you can parse with shell utilities.
You could use a \n to terminate the record.
For example
http://x.com|http://x.com|http://x.com|...\n
or
http://x.com;http://x.com;http://x.com;...\n
where each . at the end represents a space character. So it's still somehow compatible with a "normal" text file.
On the other hand, looking at your data, consider using a key-value data store like Redis: You could use the line number or the 1st URL as the key.