itextsharp - extracts text backwards - itext

Can't for the life of me figure out why when I extract text using iTextSharp, some of the text comes in backwards.
using (iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(#"C:\Temp\pdftest\sample.pdf"))
{
string sText = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy());
}
*Reason I using a LocationTextExtractionStragegy is because I will be using coordinates to pull the text from this position. I've just included a crop of the full PDF for my example. If I use a SimpleTextExtractionStrategy, the "B uy 5 egt 5" and " eerf" don't show up.
Output (from sample code):
B uy 5 egt 5
eerf
4x6 PRINTS Download free
CVS Mobile App.
Promo code O H m OBILe PICS
sed items available in all stores We reserve the right to
There's definitely something weird going on the with "eerf". In the pdf, the cursor goes horizontal when you try to select it (Big red FREE).
[
If I use acrobat professional, Advanced -> PDF Optimizer, select Transparency, then save the file, the text is extracted correctly and the "Red Free" is selectable.
So two questions, how can I emulate the PDF Optimizer in iTextSharp?
Or, how can iTextSharp read this text correctly?
As you can see this is my first post so don't beat me up too bad.
Additional Test:
I even extended the LocationTextExtractionStrategy and RegionTextRenderFilter so I could return the coordinates of each Textchunk. The weird thing about the "Big Red" Free, is the F's start and end points was the exact same. Same case with the R, and two E's. I would have expected that the end point was equal to the start point + the width of text.

Related

Trouble with font and text extrusion

So, I'm doing some odd things with text extrusion and (perhaps not surprisingly) having some odd issues in OpenSCAD.
This is part of a much larger project, but I've been able to simplify the problem down to the following snippet of code.
use<RingbearerMedium-51mgZ.ttf>
Text = "b";
Font = "Ringbearer:style=Medium";
segment_count = 2;
segment_width = 2;
text_height = 5;
text_thickness = 1;
// Iterate over each "segment" of text
for (segment_number = [0: segment_count - 1])
{
// Calculate the x offset of the current "segment" of text
segment_x_offset = segment_number * segment_width;
// Extrude the "segment" of text to the requested thickness
linear_extrude(text_thickness)
// Grab the current "segment" of text
intersection()
{
text(Text, font=Font, size=text_height);
translate([segment_x_offset, 0])
square([segment_width, text_height]);
}
}
All this does is generate a line of text (just "b", in this case), cuts it into "segments", then extrudes each segment in-place. It's not much use in this example, but in the larger one, I'm translating and rotating each segment.
OpenSCAD's F5 preview renders fine. Here's a screenshot:
F5 Preview
However, the F6 preview always drops the left side of the letter "b" and displays the error "ERROR: The given mesh is not closed! Unable to convert to CGAL_Nef_Polyhedron". Here's a screenshot:
F6 Preview
This effects other letters as well, but only seems to be a problem with the "Ringbearer" font I'm trying to use. I don't know if I can upload the font, but it's available for download for free from here: Link to Ringbearer Font. Extruding and rendering the font without breaking it into segments works just fine. It's just when I try to segment this particular font that OpenSCAD fails.
Now, the obvious answer is to use a font that works, which is fine, as far as it goes, but I'm genuinely curious why this is happening. Is it an error with the font or is this a limitation of OpenSCAD? Is this a known issue that I'm just not aware of?
I appreciate any insight I can get.
Within OpenScad, go to help->font list.
I think the font you are using isn't supported.
Yet I had a similar problem with a supported font. I changed the font few times until I found one that works for me well, and look nice as well.
It also depends on the font size. Sometimes same font fails at a smaller size, and works well on a larger size.
Anyway, you can see this on the slicer, before going to the printing. Take a close look at the slicing layers of the text, it will save you some printing time and frustration.

Text region extraction by finding co-ordinates of text from an image

I am developing an image processing software that extracts/crops and enhances this cropped single page form from an image taken from a cellphone camera.The form has no rectangular boundaries to simplify the process of extraction.Yes it is a white background black text format but nothing apart from that is fixed.Now some Text will be present which will verify that the image is of the form required.So my questions are these.
1) Can i search for a specific regular expression using leptonica library itself or do i have to shift focus to other libraries like the tessarect API to do this.So far i have not found anything of this sort
2) Now suppose i know the text at the top left corner and the bottom right corner and i search it succesfully.Can i get the co-ordinates of the particular text that i am searching and then crop the image accordingly?
Leptonica doesn't do anything with text, it's an image processing library.
To enable acquiring position of the text, add tessedit_create_hocr 1 to you Tesseract config file (or set this option whichever way you configure Tesseract if you're using it as a library).
The result is no longer a text file, but a UTF-8-encoded HTML file (note: it's not valid XML). Its format is self-explanatory. It will contain positions and dimensions of all words on all pages in pixels, as found on the input image. You need to parse that HTML, find the words you're looking for, and then get bounding boxed of those words.

Save Matlab Simulink Model as PDF with tight bounding box

Given a Simulink block diagram (model), I would like to produce a 'Screenshot' to be used later in a LaTeX document. I want this screenshot to be PDF (vector graphic, -> pdflatex) with a tight bounding box, by that I mean no unneccessary white space around the diagram.
I have searched the net, searched stackexchange, searched the matlab doc. But no success so far. Some notes:
For figures, there are solutions to this question. I have a Simulink block diagram, it's different (see below).
I am aware of solutions using additional software like pdfcrop.
PDF seems to be the only driver that really produces vector graphics (R2013b on Win7 here). The EPS and PS output seems to have bitmaps inside. You zoom, you see it.
What I have tried:
1.
The default behaviour of print
modelName = 'vdp'; % example system
load_system(modelName); % load in background
% print to file as pdf and as jpeg
print(['-s',modelName],'-dpdf','pdfOutput1')
print(['-s',modelName],'-djpeg','jpegOutput1')
The JPEG looks good, tight bounding box. The PDF is centered on a page that looks like A4 or usletter. Not what I want.
2.
There are several parameters for printing block diagrams. See the Simulink reference page http://www.mathworks.com/help/simulink/slref/model-parameters.html. Let's extract some:
modelName = 'vdp'; % example system
load_system(modelName); % load in background
PaperPositionMode = get_param(modelName,'PaperPositionMode');
PaperUnits = get_param(modelName,'PaperUnits');
PaperPosition = get_param(modelName,'PaperPosition');
PaperSize = get_param(modelName,'PaperSize');
According to the documentation, PaperPosition contains a four element vector [left, bottom, width, height]. The last two elements specify the bounding box, the first two specify the distance of the lower left corner of the bounding box from the lower left corner of the paper.
Now when I print the PDF output and measure using a ruler, I find the values of both the bounding box and the position of its lower left corner are totally wrong (Yes, I have measured in PaperUnits). That's a real bummer. I could have calculated the margins to trim off the paper to be used later in \includegraphics[clip=true,trim=...]{pdfpage}.
3.
Of course what I initially wanted is a PDF that is already cropped. There is a solution for figures, it goes like this: You move the bounding box to the lower left corner of the paper and than change the paper size to the size of the bounding box.
oldPaperPosition = get_param(modelName,'PaperPosition');
set_param(modelName,'PaperPositionMode','manual');
set_param(modelName,'PaperPosition',[0 0 oldPaperPosition(3:4)]);
set_param(modelName,'PaperSize',oldPaperPosition(3:4));
For simulink models, there are two problems with this. PaperSize is a read-only parameter for models. And changing the PaperPosition has no effect at all on the output.
I'm running out of ideas, really.
EDIT ----------------------------------
Allright, to keep you updated: I talked to the Matlab support about this.
In R2013b, there are bugs causing wrong behaviour of PaperPositionMode and the bounding box from PaperPostion to be wrong.
There is no known way to extract the scale factor from print.
They suggested to go this way: Simulink --(print)--> SVG --(Inkscape)--> PDF. It works really good this way. The (correct) bounding box is an attribute of the svg node and the scale factor when exporting to SVG is always the same. Furthermore, Inkscape produces an already cropped PDF. So this approach solves all my problems, just you need Inkscape.
You can try export_fig to export your figures. WYSIWYG! This function is especially suited to exporting figures for use in publications and presentations, because of the high quality and portability of media produced.
Why you don't like to use pdfcrop?
My code works perfectly, and everything is inside Matlab:
function prints(name)
%%Prints Print current simulink model screen and save as eps and pdf
print('-s', '-depsc','-tiff', name)
print('-s', '-dpdf','-tiff', name)
dos(['pdfcrop ' name '.pdf ' name '.pdf &']);
end
You just have to invoke pdfcrop using "dos" command, and it's works fine!
on 2021a you have exportgraphics.
beatiful pdf images.
figure(3);
plot(Time.Data,wSOHO_KpKi.Data,'-',Time.Data,Demanded_Speed.Data,'--');
grid;
xlh = xlabel('$\mathrm{t\left [ s \right ]}$','interpreter','latex',"FontSize",15);
ylh = ylabel('$\mathrm{\omega _{m}\left [ rads/s \right ]}$','interpreter','latex',"FontSize",15);
xlh.Position(2) = xlh.Position(2) - abs(xlh.Position(2) * 0.05);
ylh.Position(1) = ylh.Position(1) - abs(ylh.Position(1) * 0.01);
exportgraphics(figure(3),'Grafico de Escalon Inicial velocidad estimada por algoritmo SOHO-KpKi.pdf');

Drawing text using PdfTextArray in iTextSharp - how?

I am drawing text in a PDF page using iTextSharp, and I have two requirements:
1) the text needs to be searchable by Adobe Reader and such
2) I need character-level control over where the text is drawn.
I can draw the text word-by-word using PdfContentByte.ShowText(), but I don't have control over where each character is drawn.
I can draw the text character-by-character using PdfContentByte.ShowText() but then it isn't searchable.
I'm now trying to create a PdfTextArray, which would seem to satisfy both of my requirements, but I'm having trouble calculating the correct offsets.
So my first question is: do you agree that PdfTextArray is what I need to do, in order to satisfy both of my original requirements?
If so, I have the PdfTextArray working correctly (in that it's outputting text) but I can't figure out how to accurately calculate the positioning offset that needs to get put between each pair of characters (right now I'm just using the fixed value -200 just to prove that the function works).
I believe the positioning offset is the distance from the right edge of the previous character to the left edge of the new character, expressed in "thousandths of a unit of text space". That leaves me two problems:
1) How wide is the previous character (in points), as drawn in the specified font & height? (I know where its left edge is, since I drew it there)
2) How do I convert from points to "units of text space"?
I'm not doing any fancy scaling or rotating, so my transformation matrices should all be identity matrices, which should simplify the calculations ...
Thanks,
Chris

get the exact position of text from image in tesseract

Using GetHOCRText(0) method in tesseract I'm able to retrieve the text in html and on presenting the html in webview i'm able get the text but the postion of text in image is different from the output. Any idea is highly helpful.
tesseract->SetInputName("word");
tesseract->SetOutputName("xyz");
tesseract->Recognize(NULL);
char *utf8Text=tesseract->GetHOCRText(0);
and output image
If you have the hocr output, you should have a tag for each word. These tags should have class="ocrx_word" and name="bbox x1 y1 x2 y2" where the x and y are the top left and bottom right corner of the bounding box around the word. I don't think it's possible to automatically use this information to format a text document - would require translating pixel differences to number of tabs/spaces. But, you should be able to render text in the given location.
GetBoxText() method will return exact position of each characters in an array.
char *boxtext = _tesseract->GetBoxText(0);
NSString* aBoxText = [NSString stringWithUTF8String:boxtext];