Does the font name in Tesseract box/tif filenames matter?

Does the font name in Tesseract box/tif filenames matter? - tesseract

In the Tesseract wiki the format for labeled tif/box file filenames to be used in training is given as [lang].[fontname].exp[num]. Does fontname actually impact training or is this just for bookkeeping?
In my particular case, I have a large number of document images with different fonts (and I don't know which fonts are in them). Can I just use eng.idontknow.exp[num] for each document I label manually or will this mess up training for some reason? Thanks in advance!

It's best to match a real font (to help possible post-OCR analyses), but it can be some arbitrary font name.

Related

Errors using ps2ascii on some files

What does FC_WEIGHT refer to? Please advise: Although a text file was produced it is large and consists largely of numbers which makes it hard to proofread. I need relatively good confidence the output matches the input. If there is a fix please point me to it and bring joy to my dull drab existence.
entered the command
ps2ascii /Users/dwstclair/Desktop/untitled3/stmt_20181130.pdf a.txt
The result was:
DEBUG: FC_WEIGHT didn't match
On the off chance a default font was missing on my system
I added DroidSansFallback.ttf (no joy)

Basically, I wouldn't use ps2ascii. Its long been deprecated and doesn't even ship in more recent versions of Ghostscript.
Instead consider using the txtwrite device. It works with a wider range of input (in particular it can use ToUnicode CMaps in PDF files, which ps2ascii cannot) and is capable of producing output in other than ASCII, which is quite useful. Even if you aren't working with non-Latin languages, the ability to preserve ligatures (eg fi, ffi, ffl etc) is convenient.
The actual answer to your question is 'don't worry about it'.
FC_WEIGHT refers to the weight of a font (light, bold, regular, ExtraBold etc). This message can only arise when you are using FontConfig, and Ghostscript is enumerating the available fonts from font config, trying to find a match for a missing font in the input. This means that a candidate font did not match the target font's weight.
Since you aren't going to use the font, it doesn't affect you.

Where to get "all-chars-are-zero-spaces" fallback font?

In order to detect if font contains some particular character in javascript I've decided that the best way is to have fallback font where ALL unicode characters have exactly ZERO width spaces. This font would allow me to easily check existing of himself, and existing of any character in any other font (except for conrtol characters). I would just check width of character.
Do you know if such font already exists?
It should be very simple to make it with FontForge and scripting. But it is hard for me to get into FontForge and Unicode docs. If someone is fluent in FontForge, could you teach me, or just make this kind of font. I assume it is, what, like 50 script lines on Python?

https://github.com/adobe-fonts/adobe-blank – answered by Mike 'Pomax' Kamermans
Very nice. Just 7kb for woff version! My own attempts to make such a font myself in FontForge gave about 1mb for 0000-1ffff unicode range.

Are cross-platform PNG-in-OTF fonts possible efficiently?

The most recent version of the Opentype font format (1.8 as of late 2016) standardizes two different tables to embed PNG bitmap data: Google’s CBDT (together with CBLC) and Apple’s sbix. Furthermore, the SVGs in Mozilla’s SVG  table can also embed or reference PNGs.
Is it possible to embed the PNG chunks once and use them in at least two tables to make cross-platform emoji font files that are not bigger than necessary?
Side question: can PNG chunks be reused for multiple glyphs, e.g. indexed color palettes?
PS: I know that Apple’s operating systems override emojis with those from a font which has the PS name AppleColorEmoji.

You can't share images across tables, e.g. use PNG images in the sbix table in the cbdt table. But if you use the exact same image files they might be "deduped" in a compressed WOFF.
Weird thing is that the CBDT/CBLC spec says a glyf table shouldn't be present, while the other formats require it. So you can't put cbdt alongside an sbix or svg table in a font. But you could combine the latter two to get relatively good support on Windows and OSX.

iText -- How do I identify a single font that can print all the characters in a string?

This is wrt iText 2.1.6.
I have a string containing characters from different languages, for which I'd like to pick a single font (among the registered fonts) that has glyphs for all these characters. I would like to avoid a situation where different substrings in the string are printed using different fonts, if I already have one font that can display all these glyphs.
If there's no such single font, I would still like to pick a minimal set of fonts that covers the characters in my string.
I'm aware of FontSelector, but it doesn't seem to try to find a minimal set of fonts for the given text. Correct? How do I do this?

iText 2.1.6 is obsolete. Please stop using it: http://itextpdf.com/salesfaq
I see two questions in one:
Is there a font that contains all characters for all languages?
Allow me to explain why this is impossible:
There are 1,114,112 code points in Unicode. Not all of these code points are used, but the possible number of different glyphs is huge.
A simple font only contains 256 characters (1 byte per font), a composite font uses CIDs from 0 to 65,535.
65,535 is much smaller that 1,114,112, which means that it is technically impossible to have a single font that contains all possible glyphs.
FontSelector doesn't find a minimal set of fonts!
FontSelector doesn't look for a minimal set of fonts. You have to tell FontSelector which fonts you want to use and in which order! Suppose that you have this code:
FontSelector selector = new FontSelector();
selector.addFont(font1);
selector.addFont(font2);
selector.addFont(font3);
In this case, FontSelector will first look at font1 for each specific glyph. If it's not there, it will look at font2, etc... Obviously font1, font2 and font3 will have different glyphs for the same character in common. For instance: a, a and a. Which glyph will be used depends on the order in which you added the font.
Bottom line:
Select a wide range of fonts that cover all the glyphs you need and add them to a FontSelector instance. Don't expect to find one single font that contains all the glyphs you need.

A set of typefaces that cover the whole Unicode character range

Does anybody know a set of typefaces that altogether cover the whole Unicode character range? we know that it is impossible to display all unicode characters using just one or two fonts. But probably, we can find a set of fonts using them the whole Unicode range could be displayed. Does anybody have any experience?
Thank you so much in advance.

One way to find such set of fonts is to look into Windows Font Linking. If you take a look at the registry key HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink you'll see fonts that "link" to cover the complete Unicode set.

as far as i know Arial Unicode is one of the full.

Everson Mono covers a large portion of the Unicode characters, and SIL International makes a lot of different fonts for minority languages.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Does the font name in Tesseract box/tif filenames matter? - tesseract

It's best to match a real font (to help possible post-OCR analyses), but it can be some arbitrary font name.

Related

Errors using ps2ascii on some files

Where to get "all-chars-are-zero-spaces" fallback font?

Are cross-platform PNG-in-OTF fonts possible efficiently?

iText -- How do I identify a single font that can print all the characters in a string?

A set of typefaces that cover the whole Unicode character range

Categories

Resources