After recognition, I take results with two different ways:
page.GetText();
and
page.GetMeanConfidence();
I expect recognized text to be the same with both ways, but page-level text has much more recognition errors.
Why this happens and what is the right way to obtain page full text?
What does FC_WEIGHT refer to? Please advise: Although a text file was produced it is large and consists largely of numbers which makes it hard to proofread. I need relatively good confidence the output matches the input. If there is a fix please point me to it and bring joy to my dull drab existence.
entered the command
ps2ascii /Users/dwstclair/Desktop/untitled3/stmt_20181130.pdf a.txt
The result was:
DEBUG: FC_WEIGHT didn't match
On the off chance a default font was missing on my system
I added DroidSansFallback.ttf (no joy)
Basically, I wouldn't use ps2ascii. Its long been deprecated and doesn't even ship in more recent versions of Ghostscript.
Instead consider using the txtwrite device. It works with a wider range of input (in particular it can use ToUnicode CMaps in PDF files, which ps2ascii cannot) and is capable of producing output in other than ASCII, which is quite useful. Even if you aren't working with non-Latin languages, the ability to preserve ligatures (eg fi, ffi, ffl etc) is convenient.
The actual answer to your question is 'don't worry about it'.
FC_WEIGHT refers to the weight of a font (light, bold, regular, ExtraBold etc). This message can only arise when you are using FontConfig, and Ghostscript is enumerating the available fonts from font config, trying to find a match for a missing font in the input. This means that a candidate font did not match the target font's weight.
Since you aren't going to use the font, it doesn't affect you.
In the Tesseract wiki the format for labeled tif/box file filenames to be used in training is given as [lang].[fontname].exp[num]. Does fontname actually impact training or is this just for bookkeeping?
In my particular case, I have a large number of document images with different fonts (and I don't know which fonts are in them). Can I just use eng.idontknow.exp[num] for each document I label manually or will this mess up training for some reason? Thanks in advance!
It's best to match a real font (to help possible post-OCR analyses), but it can be some arbitrary font name.
Is there reliable way for determining if glyph in unifont is half width like latin characters (ie all in chart 0002) which take left half space only or full width like character 0x06E9 (from chart 0006)?
Pixel analysis is not good solution for me as it would fail on many characters like spaces.
I'd prefer to use information from UnicodeData.txt:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
Unfortunately I'm not able to find good match between unifont and any field from data.
Chart 0002: http://unifoundry.com/png/plane00/uni0002.png
Chart 0006: http://unifoundry.com/png/plane00/uni0006.png
Looks like you'll need the source code '.hex' for the version of unifont you're using and the appropriate versions of the Unicode Utilities from [1]. 'unigenwidth' [2] seems to generate code related to the width of characters in Unifont; perhaps you'll need to write a parser to look through that code and give you what you want?
[1] http://unifoundry.com/unicode-utilities.html
[2] http://manpages.ubuntu.com/manpages/trusty/man1/unigenwidth.1.html
I'm trying to use fonts from the Nitti Basic family for programming. These fonts are packaged as OpenType PostScript OTF files.
Its U+002D (HYPHEN-MINUS) glyph works well as a hyphen, but not so well as a minus.
For example, it doesn't line up with the horizontal bar of the plus sign.
On the other hand, Nitti's glyph for U+2212 (MINUS) is perfect as a minus (of course), and this is what I need when programming. It's not feasible for me to actually use codepoint U+2212; after all, U+002D is what you get when you press the minus sign on the keyboard and it's what programming languages use for subtraction.
So instead I'd like to steal the glyph from U+2212 and use it for U+002D, so that that character looks like a minus sign.
How can I do it?
Update: Yes, it is possible to use U+002D as a hyphen in source code.
As mentioned above, a minus sign is what I need.
I agree with Jukka, there are tools to do this.
However, please don't forget that a font is usually protected by very similar contracts as software. In this case the link you provided for example points to a legal document that reads (amongst much other):
"Except as permitted herein, you may not rename, modify, adapt,
translate, reverse engineer, decompile, disassemble, alter or
otherwise copy the Bold Monday Font Software."
Notice the fact that you're not permitted legally to change this font. If you read the rest of the agreement you'll see a lot of restrictions on the actual use of the font as well. Make sure you're not breaking your license by what you are doing...
For posterity, here's how to do it:
Obtain Adobe's AFDKO font tools and install them.
Put the OTF files into an empty directory.
Run ttx *.otf to convert the OTF files to TTX (XML).
Edit each TTX file in a text editor:
In the cmap section, change occurrences of hyphen to minus. This table maps characters to glyphs. Character U+002D was originally mapped to the hyphen glyph; this change maps it to the minus glyph.
Over the whole file, change ocurrences of NittiBasic to NittiBasicM and Nitti Basic to Nitti Basic M. This will distinguish the modified version of the font from the original once it's installed.
Rename the TTX files, replacing Nitti Basic with Nitti Basic M.
Run ttx -b *.ttx to convert the TTX files back to OTF.
Finally, install the newly-created OTF files.
Tools like FontForge can be used to edit a font in a simple manner.
Note that in programming, too, HYPHEN-MINUS has multiple uses: as a minus sign, but also (in some languages) as allowed in identifiers, as well as in comments, where it usually appears in the role of hyphen. In some uses, a HYPHEN glyph will look odd.