Character extraction - dots are recognized separately from the character - matlab

I'm doing character recognition for a regional language. While extracting the image, the dots are being separately identified as characters.
%% Plot Bounding Box
for n=1:size(propied,1)
rectangle('Position',propied(n).BoundingBox,'EdgeColor','g','LineWidth',2)
end
hold off
%% Characters being Extracted
figure
for n=1:Ne
[r,c] = find(L==n);
n1=imagen(min(r):max(r),min(c):max(c));
imshow(~n1);
end
Original code: http://www.mathworks.com/matlabcentral/fileexchange/22922-image-segmentation-extraction

Since you are doing character/text recognition, you are more likely to want collections of words or lines of text, and not individual characters. And if you really want to do the latter, then it is more robust after you have identified the individual words.
So, the simplest approach here is using the standard morphological opening (assuming the text is black, otherwise use closing) operator. Start with a large horizontal structuring element (SE). Applying a opening with this SE will divide your image in lines of text. In each line you use a shorter horizontal SE to obtain the individual words. Then for each word you consider a vertical SE for opening such that it joins accents and other typographical details.
For example, here is an input image, its opening with a horizontal SE of radius 35, the opening with a horizontal SE of radius 7, and a opening with a vertical SE of radius 7.
I didn't apply the third operation in isolated components, but you should do so to not risk joining two lines of text. And this is all assuming straight horizontal lines of text, of course. Drawing the bounding boxes on this final image gives the result you are after:
Note that some letters ("ty", and "ny") were connected in the beginning, so they appear as a single letter in this output. This is a separate problem to be handled, which might or not be a concern for you.

Related

How to add blank space in parameter dialogs

I would like to add vertical blank space between two lines in a Modelica component GUI. I have (in this case), six similar looking parameters/comments within a group on a tab that I would like to visually break into two plus four by using a little space after the second one (essentially a blank line, but ideally I would have some control over the gap space).
I can put the last four in a separate group with no name, which gives a line in between the two/four. This is better than nothing, but not really what I want. I tried using groupImage, but it locates the image relative to whole group, so if it's smaller than the six entries, it doesn't affect them at all; and if it's larger, it introduces uniform spacing between all the entries, rather than just between 2/3.

precise scaling of matlab textboxes with axes magnification

I would like to have a text box rescale with the level of magnification, such that one unit of text is always assigned one unit of horizontal axis-length. The text width should not change but rather the spacing between characters.
For instance, if the x-axis displayed [0:50], fifty characters should be displayed, one at each integer position. If the magnification was increased such that the display comprised only [0:10], only ten characters would be displayed, again placing one character at each integer position along the horizontal axis.
Finally, the text would ideally not display when the magnification level was below some threshold determined by the number of characters that can be legibly printed along a horizontal line spanning the extent of the axes.
I have tried using the text object, but it doesn't seem to have the relevant properties to allow such dynamic behavior. I have instead considered breaking the N-length string into N unit-length strings and placing each at a defined x-position, but I'm having trouble figuring out how to display only those relevant at the prevailing zoom level (there is some spill-over of characters beyond the bounds of the axis). In contrast, with this approach, all the characters appear as a jumble at zoom levels so low that the number of characters printed cannot be reasonably accommodated.
Thus, I inquire whether another solution besides printing a series of unit-length strings might be advised and, if not, how the twin problems of text spill-over and text overlap can be resolved at high and low zoom, respectively (the first might be done by somehow preventing printing of information outside the axes; the second seems to require some dynamic magnification-aware means of suppressing text output at or above a certain x-axis extent).

Drawing text using PdfTextArray in iTextSharp - how?

I am drawing text in a PDF page using iTextSharp, and I have two requirements:
1) the text needs to be searchable by Adobe Reader and such
2) I need character-level control over where the text is drawn.
I can draw the text word-by-word using PdfContentByte.ShowText(), but I don't have control over where each character is drawn.
I can draw the text character-by-character using PdfContentByte.ShowText() but then it isn't searchable.
I'm now trying to create a PdfTextArray, which would seem to satisfy both of my requirements, but I'm having trouble calculating the correct offsets.
So my first question is: do you agree that PdfTextArray is what I need to do, in order to satisfy both of my original requirements?
If so, I have the PdfTextArray working correctly (in that it's outputting text) but I can't figure out how to accurately calculate the positioning offset that needs to get put between each pair of characters (right now I'm just using the fixed value -200 just to prove that the function works).
I believe the positioning offset is the distance from the right edge of the previous character to the left edge of the new character, expressed in "thousandths of a unit of text space". That leaves me two problems:
1) How wide is the previous character (in points), as drawn in the specified font & height? (I know where its left edge is, since I drew it there)
2) How do I convert from points to "units of text space"?
I'm not doing any fancy scaling or rotating, so my transformation matrices should all be identity matrices, which should simplify the calculations ...
Thanks,
Chris

Line detection using PIL

Given an image consisting of black lines (a few pixels wide) on white background, what is a good way to find the coordinates along the the lines, say for every 10th pixel or so? I am considering using PIL for the task, but other python or java-based libraries would also be OK.
Ideally the coordinates would point to the middle of the line, but as the lines are narrow, it's enough that they point somewhere inside the line.
A very short line or a point should be identified with at least one coordinate.
Usually, Hough transformation is used to find lines. It gives you the parameters describing the line (which can be transformed easily between different representations), and you can sample this function to get your sample points. See http://en.wikipedia.org/wiki/Hough_transform and https://stackoverflow.com/questions/tagged/hough-transform+python
I only found this http://coding-experiments.blogspot.co.at/2011/05/ellipse-detection-in-image-by-using.html implementation in python, which actually searches for ellipses.

How can I detect any unicode characters which have descenders, using .NET

I am trying to minimize the vertical distance between controls on a programmatically constructed Windows Form (using C#). This involves setting the Height property appropriately.
I have found that if the text of the control does not contain any letters with descenders in them (i.e. does not have any of the characters j, g, p, q or y) then the control Height can be smaller than when it does contain such letters (if it does contain letters with descenders then the descenders are chopped off if the Height isn't enough).
It will work fine to test for any of the above 5 characters as long as the language is English, or English - like, but I need to be able to cater for (just about) any language.
Is there a way, given some arbitrary Unicode character (and perhaps a font) to determine if that Unicode character has a descender or not?
There is no property defined for Unicode characters to indicate the presence of a descender, and it’s really a feature of glyph design rather than characters. For example, “Q” has a descenders in many fonts, and “J” has one in some. Besides, given the context, you should also consider diacritic marks placed below a letter, not just descenders of base letters. And probably diacritics above letters, too.
So you would need to read the font information (when available) about character dimensions, or tentatively draw characters in your software and measure their dimensions.
As a rule of thumb, any line height below 1.1 times the font size will cause problems with some characters and fonts. Using 1 (“setting solid”) is not enough, because characters may in fact extend outside the font size.
In Windows, you call GetPath() to get an array containing the X/Y coordinates of every point making up the perimeter or outline of the string of glyphs. Search the array for min/max, which will get you the rectangle exactly enclosing the string. Right to the edge of the letters.