Tesseract - difference in recognized text in whole page and words

Tesseract - difference in recognized text in whole page and words - tesseract

After recognition, I take results with two different ways:
page.GetText();
and
page.GetMeanConfidence();
I expect recognized text to be the same with both ways, but page-level text has much more recognition errors.
Why this happens and what is the right way to obtain page full text?

Related

ICU Message Format and Interaction / Links

In my application I use the ICU message format to localize user-visible strings for different languages. A relevant (contrived) example would be:
(EN) Click on this link to find out more.
(DE) Dieser Link führt zu weiteren Informationen.
The issue I ran into is interactivity and styling for this sentence. I want the bold fragments to be clickable links in the sentence and style them differently.
The go-to solution for such cases would be to separate the bold words and make them a separate ICU message and handle this message separately in the app (to apply interactivity and styling), perhaps like a button. The problem is how this should be implemented in the context of a sentence, since the language prescribes different number and order of sentence-fragments:
(EN) (Click on) (this link) (to find out more)
(DE) (Dieser Link) (führt zu weiteren Informationen)
I could of course feed the string of the link as a parameter into an ICU message containing the surrounding sentence to obtain the sentence as one string, but then I can't apply the link-action or styling in-app, since I end up with one sentence.
What is a solution to this problem? For context: The app is written in flutter targeting mobile.

VSCode custom font ligatures, display two symbols as one?

I would like to display list.emptyqm() as list.empty?() in function names for specific language. So, two symbols qm if they are at the end of the function name should be displayed as ? (possibly some unicode symbol looking similar to question mark).
Is that possible in VSCode?
The VSCode already knows that piece of text is string, or function-name/keyword/variable-name (as it highlights it properly), so the ligature should be displayed only if qm are the last
characters of function-name/keyword/variable-name. It shouldn't be displayed in the middle of the function name, like aqma() shouldn't be displayed as a?a().

You seem to misunderstand what a ligature is. A ligature describes how two individual letters can be combined to form a visual pleasing appearance. A ligature never changes the syntax of a text. Hence, converting qm to ? is a completely different thing.
Replacing text in vscode is of course possible, for instance as part of the format command. You can register your own formatter and determine the text edit actions that you want to be applied, including the transformation of these character sequences.

Is there a "n/a" symbol in unicode?

Is there an unicode symbol for "n/a"? There are some fractions like ½, but a n/a symbol seems to be missing.
If there is none, what would be the most appropriate unicode symbol to use for n/a in a website (which should be contained in common fonts, to avoid needing a webfont)?

Looking at the Unicode code charts, I do not see a single N/A symbol. I do, however, see ⁿ (U+207F) and ₐ (U+2090), which you could separate with / (U+002F) eg: ⁿ/ₐ, or ̷ (U+0337), eg: ⁿ̷ₐ, or ̸ (U+0338), eg: ⁿ̸ₐ. Probably not what you are hoping for, though. And I don't know if "common" fonts implement them, either.

For future reference, the fastest way I know to answer questions like the OP's when I have them myself is to go to unicodelookup.com, because of the way it works: there's a search bar at the top, and you just type a string and it will return any and all unicode characters containing that string (this is also a great way to discover new and useful symbols). So in the OP's case, he could proceed like this:
first try entering "not" (without the quotes) in the search field
visually scan through the results... doing so would not reveal a "not
applicable" character in this case
try again but this time entering "applic" in the search field
again, doing so would not turn up anything along the lines of what he's
looking for
At that point he would be reasonably confident the current Unicode standard does not have a "n/a" symbol.
If you use Firefox you can define a keyword like "uni" to search that site from the URL bar, meaning any time the browser is open and regardless of what page or site is currently showing, you could do this:
hit [F6]... this moves the cursor to the URL bar at the top
type something like "uni applic" and hit [Enter]... this brings up the
unicodelookup.com website with the search results for "applic" already
showing
For the above to work you would need to define your keyword ("uni" or wtv you prefer) to point to location http://unicodelookup.com/#%s.

There's a Negative Acknowlege icon...
␕ symbol for negative acknowledge 022025 9237 0x2415 ␕
Found by searching negative on the Unicode Lookup site.
I'm not a fan, and for my purposes have just gone with __N/A__ (Markdown..)

I see lots of answers going head-on at the "Not Applicable" abbreviation, without exploring what a symbol is. A quick search for the equivalent phrase "out of scope" brings up a couple of variations on the No symbol: ⃠ – this seems to fit the bill (and since I was looking for a way to represent inapplicability, I'll be using it in my technical document).
Per the Wikipedia article, the Unicode codepoint U+20E0 is a combining character, so it is superimposed on the preceding character; e.g. ! ⃠ overlays an exclamation point. To get it to appear isolated, use a non-breaking space
If you don't want to bother with the combining symbol, the article mentions there's also an emoji U+1F6AB 🚫 but it's typically going to be colored red, or won't render!

There's actually a single character that could be repurposed for this: the "Square Na" character ㎁ (U+3381), which is used to represent the nanoampere in fullwidth (CJK) scripts.

What about the "SYMBOL FOR NULL" ␀ (U+2400)?

Where to get a reference image for any unicode code point?

I am looking for an online service (or collection of images) that can return an image for any unicode code point.
Unicode.org does not have an image for each one, consider for example
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=31cf
EDIT: I need to use these images programmatically, so the code chart PDFs provided at unicode.org are not useful.

The images in the PDF are copyrighted, so there are legal issues around extracting them. (I am not a lawyer.) I suspect that those legal issues prevent a simple solution from being provided, unless someone wants to go to the trouble of drawing all of those images. It might happen, but seems unlikely.
Your best bet is to download a selection of fonts that collectively cover the entire range of characters, and display the characters using those fonts. There are two difficulties with this approach: combining characters and invisible characters.
The combining characters can easily be detected from the Unicode database, and you can supply a base character (such as NBSP) to use for displaying them. (There is a special code point intended for this purpose, but I can't find it at the moment.)
Invisible characters could be displayed with a dotted square box containing the abbreviation for the character. Those you may have to locate manually and construct the necessary abbreviations. I am not aware of any shortcuts for that.

Best way to represent format for presentation of cells in a grid?

I am building a dynamic reporting feature for a client. They want to create new stored procedures, and have them correspond to new reports. We are using T-SQL and each cell in a grid/report can have its own formatting and/or functionality.
I'm looking for a format specification to identify presentation, color and conditionals for data... for instance, I am thinking of something like this:
{data}|{format}
123.56|$#,##0.00
Results in ... $123.56
I am looking for standard ways to represent the formatting field, with the potential for colors and conditionals. Is there some standard out there already?

It all depends on what you're looking for. You have to ask yourself what types of formatting you wish to apply. Here's some cases you might want to consider:
In-line formatting
Do you want to have a cell that contains mixed formatting (e.g. "1234.567" shows bold, regular and italic in a single cell)?
Multi-column based output
Do you want to output a value in a cell that's based on multiple cells?
Cell1="1234"
Cell2="56"
Cell3={Cell1}.{Cell2}
---> which would output "1234.56"
If don't need either of those things, then all you want to do is provide a single format for the entire cell. Let's divide it into the two formatting elements: transformations and visual effects:
Formatting "1234.5678" into "1234.56" is a transformation. It has to be done by code that knows how to interpret the value as a number, and how to turn that number into the textual string of digit-characters.
Making a cell blue, or the text red, or bold - these are all visual transformations that are merely a set of attributes regarding the display of data in a cell. We don't care here about the type of data in the cell, since we just have to put pixels on a screen.
So, to bottom-line this: it's all about what you want to happen. If you're producing HTML reports, then HTML & CSS are very convenient methods for describing the visual-effects formatting of the cell, since you won't have to convert it twice.
As far as I know, there's only a couple of standards for encoding visual-effects display, and they are similar to SGML - TeX, HTML, PostScript, etc; they all have "tags" (sometimes with "attributes") to modify the display of the content within the tag.
Which leaves us the transformational formatting. There were two common approaches to this. The first is procedural. You list a set of transformations you wish to do on the data to turn it into text. Nowadays, we often use substitution masks, like in your example, $#,##0.00, or like in sprintf's %.2f, etc.
Again, just choose a formatting specifier that is the simplest to use in your environment. If you're coding in a language that accepts a certain format, then use it!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse