i am using iTextSharp to extract text from pdf.
The text which i am looking for has varying position and it depends on how the pdf is printed, by and large the text position is not fixed.
Can anyone give any logic how can it be done ?
Thanks.
Related
I'm working on a mind map editor in which the user can draw boxes and write text in them. However, the TMPro input fields I'm using in those boxes have extra-wide caret when I type in them, and changing the fonts didn't solve the problem. Here are some images of the issue:
The caret is so wide that it can push the text inside out of the box:
I tried to lower the caret width in my script, but it's an int and has already been set to 1. Can you give me some possible reasons as to why this is happening?
I solved that problem by multiplying width+height of my inputfield by 100 and dividing its scale by 100.
Don't forget to increase the font size significantly.
If I export my matlab figure as an eps using:
print('myfig','-depsc')
I then open it in another software, in my case Illustrator CS6.
The text appears ok, but what should be a single text box, say a legend entry, is actually multiple text boxes arranged so that it looks like one.
In the image below, the black text is what it looks like first, but I have also shown a copy of the same text, with each text box a different color.
If I want to edit any of this its very difficult as the space will then be messed up. Also if I change the font, the kerning gets messed up.
I have also tried using the text command to place text on the axis, and this also ends up in multiple text boxes.
Is there any way to fix this?
Am I missing something?
Just to be clear, I would like to fix matlab's eps, Not use a different software.
I am developing an image processing software that extracts/crops and enhances this cropped single page form from an image taken from a cellphone camera.The form has no rectangular boundaries to simplify the process of extraction.Yes it is a white background black text format but nothing apart from that is fixed.Now some Text will be present which will verify that the image is of the form required.So my questions are these.
1) Can i search for a specific regular expression using leptonica library itself or do i have to shift focus to other libraries like the tessarect API to do this.So far i have not found anything of this sort
2) Now suppose i know the text at the top left corner and the bottom right corner and i search it succesfully.Can i get the co-ordinates of the particular text that i am searching and then crop the image accordingly?
Leptonica doesn't do anything with text, it's an image processing library.
To enable acquiring position of the text, add tessedit_create_hocr 1 to you Tesseract config file (or set this option whichever way you configure Tesseract if you're using it as a library).
The result is no longer a text file, but a UTF-8-encoded HTML file (note: it's not valid XML). Its format is self-explanatory. It will contain positions and dimensions of all words on all pages in pixels, as found on the input image. You need to parse that HTML, find the words you're looking for, and then get bounding boxed of those words.
I am using CGPDFScanner to scan the pdf. Should I use Td operator to find positions of text? Can I have an example that how to use this operator to get positions of the text? Current I have used Tj and TJ operator to find the text. Now I would like to know position of each word in a single page of pdf. How can I do that?
Thanks
Look this library:
https://github.com/KurtCode/PDFKitten/
search and highlight text
To get the coordinates of the text you need to keep track of the text transformation matrix. See section 5.3.1, "Text Positioning Operators" of the PDF 1.4 Reference. (I'm not sure if later versions of the reference number things the same or not.) While the Td operator will set the current translation in the text matrix, there are other operators that affect the text matrix and other text state, as well. You need to keep track of the text matrix as the file is processed. The Tm operator will directly set the text matrix. The TD operator moves to the next line and offsets by the x and y parameters. T* just moves to the next line.
Using GetHOCRText(0) method in tesseract I'm able to retrieve the text in html and on presenting the html in webview i'm able get the text but the postion of text in image is different from the output. Any idea is highly helpful.
tesseract->SetInputName("word");
tesseract->SetOutputName("xyz");
tesseract->Recognize(NULL);
char *utf8Text=tesseract->GetHOCRText(0);
and output image
If you have the hocr output, you should have a tag for each word. These tags should have class="ocrx_word" and name="bbox x1 y1 x2 y2" where the x and y are the top left and bottom right corner of the bounding box around the word. I don't think it's possible to automatically use this information to format a text document - would require translating pixel differences to number of tabs/spaces. But, you should be able to render text in the given location.
GetBoxText() method will return exact position of each characters in an array.
char *boxtext = _tesseract->GetBoxText(0);
NSString* aBoxText = [NSString stringWithUTF8String:boxtext];