How to find parameters supported in Tesseract OCR config file - tesseract

I want to know what parameters the config file used by Tesseract OCR accepts, how to write a config file, etc.
I can't find any documentation about this on their site. How can I determine what parameters are supported, and what they mean?

I found these instructions in the link below. They are about writing the config file and where to place it:
config file is simple text file without BOM and with Unix end-of-line mark (on Windows you can use some advanced text editor e.g. Notepad++ to achieve this).
If you use tesseract executable this is only way how to change tesseract parameters.
config file should be located in your tessdata/configs directory. Have a look there for some examples.
There is a list of all the variables plus descriptions of each one in http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version. Note it's for Tesseract 3.02, things may be different in other versions.
Edit: Also adding a pastebin link in case the above link becomes dead.

Tesseract v3.04 now offers the command line option --print-parameters, so you can call tesseract --print-parameters to get a list of the 678 (!) configurable parameters, their default values, and a short description:
Tesseract parameters:
editor_image_xpos 590 Editor image X Pos
editor_image_ypos 10 Editor image Y Pos
editor_image_menuheight 50 Add to image height for menu bar
editor_image_word_bb_color 7 Word bounding box colour
editor_image_blob_bb_color 4 Blob bounding box colour
editor_image_text_color 2 Correct text colour
...and many, many more

It's just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line; for instance:
interactive_display_mode T
tessedit_display_outwords T
There are several standard config files -- such as digits, hocr -- under Tesseract tessdata/configs folder.

Related

Error using text2image Font Exocet Light failed with 223518 hits = 99.94% when trying to build image file using Diablo 2 font

I am running tesseract on windows 11 using the command prompt.
The text file is my training data. Words that I want to turn into images.
The output is the next step in the Tesseract process for training my font.
I am saying find fonts but I only have one font in the folder.
text2image --text="C:\PythonProjects\DiabloTesseractTrainFont\text.txt" --outputbase="C:\PythonProjects\DiabloTesseractTrainFont\Output\Dia.font.exp0" --fontconfig_tmpdir="C:\PythonProjects\DiabloTesseractTrainFont" --find_fonts --fonts_dir="C:\PythonProjects\DiabloTesseractTrainFont\Diablo Fonts"
The result:
Total chars = 223645
Font Exocet Light failed with 223518 hits = 99.94%
Not sure why it fails. I have built something similar to this before. I have tried with a font file that I know has worked and it does the exact same thing.
Any help would be appreciated.
I solved it. In the text file, there were some characters that had been changed when I read them into python. I believe they used to be bullet points but when I read the file I had implemented in python ASCII encoding and ignore errors. I figured that those characters would be removed. I was wrong. Those bullet points were replaced with text that said PAD. I found it in notepad++ and highlighted one of them and then replaced them with a space. Note in Notepad++ when I did the replace it did not have anything in the find field but it still replaced all of them. Now it compiles just fine. I was stuck for many hours I hope this helps someone.

How can a MATLAB program test whether MATLAB can render a particular font?

I would like to use some special characters in a MATLAB figure. How can my program ensure the fonts are available before using them?
listfonts() is not reliable. It claims "Zapf Dingbats" is available on my machine, but it is not (and text() renders using a default font instead). listfonts() always includes the standard PostScript fonts. I suppose that's because they are always available for PostScript output, but I'm interested in displayed figures. Likewise uisetfont() and MATLAB -> Preferences -> Fonts -> Custom list "Zapf Dingbats", but render the sample using a default font.
Just looking for the font file doesn't work, either. For example, "Webdings" works fine on my main machine. However, on a second machine, "Webdings" is installed (there's a file /Library/Fonts/Webdings.ttf, and Word can use it), but MATLAB substitutes a default font.
I thought of one test: Create a small figure with one marker symbol, use print() to write it to a .png file, read in the file as data, compute a hash, and compare that hash with a stored value. Is there a less clumsy method?
I found Unicode equivalents for most of the symbols I need, that work for both of my test machines. However, they too apparently depend on my having the right fonts installed. For example, there are many Unicode versions of a square. Hex codes 2588, 25a0, and 25fc work here, but 25fe, 2b1b, and 2bc0 are rendered as blank. Is there a way to tell whether these characters are available?
I'm running R2017b under macOS version 10.13.5, and "set | grep LANG" displays "LANG=en_US.UTF-8".

How to make fonts available to the LaTeX interpreter in Matlab R2013a?

It is possible to embed LaTeX-formatted text and equations into Matlab plots by setting the text property 'Interpreter' to the value 'latex', e.g.
text(0.1, 0.5, 'Einstein: $E = m c^2$', ...
'Interpreter', 'latex', 'FontSize', 32)
These equations appear on screen as well as in illustrations exported to eps files.
Through the appropriate LaTeX commands, it is also possible to change the font from the default Computer Modern Serif to e.g. Computer Modern Typewriter
text(0.1, 0.5, '\fontfamily{cmtt}\selectfont Einstein: $E = m c^2$', ...
'Interpreter', 'latex', 'FontSize', 32)
My question is: Is it possible to insert additional fonts into the Matlab installation, such that these fonts become available for use with 'Interpreter' 'latex', for rendering on screen as well as producing eps files? And if yes, how?
Background
(All paths relative to the Matlab installation, /opt/MATLAB/R2013a on my Linux system.)
Matlab includes a customized version of the (La)TeX interpreter. It is called via a frontend m-file called tex.m in toolbox/matlab/graphics which takes LaTeX code as an argument and returns dvi data within its output argument. The customized LaTeX installation is found in sys/tex and includes TeX font metric files under sys/tex/tfm.
I do not have any information on the parts of Matlab that render this dvi. However, font data for rendering are found under sys/fonts/ttf and sys/fonts/type1.
Making additional fonts usable therefore consists of two parts: Making it available for the LaTeX interpreter, and making it available for the rendering function. The first part can be tackled by manipulating tex.m, such that it generates the dvi through an independent regular installation of LaTeX, and installing the font to this LaTeX in the usual way (e.g. font packages). See undocumentedmatlab.
The second part of the question is therefore the crucial one: How to insert additional fonts into sys/fonts/ttf and sys/fonts/type1 such that they become usable by the dvi renderer component of Matlab.
Concrete case
I tried to concretely solve the second problem for a special case: The Computer Modern Sans font is included in the Matlab-LaTeX installation through tex/tfm/cmss10.tfm, but the corresponding ttf and pfb-files are missing from sys/fonts such that it does not get rendered.
Matlab's collection of ttf-files does not appear to have some kind of inventory. I therefore simply copied the file cmss10.ttf from an installation of matplotlib to sys/fonts/ttf/cm/mwa_cmss10.ttf, following the file and folder naming conventions of the other files present. This procedure was reported to be working on Alec's Web Log for Matlab 2011b on Max OS X, but on my system it has no effect, neither for screen display nor eps export.
Matlab's collection of type1 fonts has a complex inventory, distributed over files fonts.dir, fonts.scale, encodings.dir and a folder encodings full of enc-files. Again I found cmss10.pfb, this time from a TeXlive installation, renamed and copied it, and made entries in the inventory files following the example of the other fonts listed. Again, this procedure has no effect at all.
Does anyone know more about how Matlab uses ttf and pfb-files, and can give me a hint on how to make the cmss10-files accessible to Matlab rendering? Or does anyone have a suggestion how to debug this and find out more about the inner workings of Matlab's LaTeX support?
I invested hours of further research into my question, and came up with some interesting new insights, but no real solution. Still, I'm posting my results here in order for others who might investigate this to start from. I post it as an "answer" not make my already long question even longer.
Comparison between Matlab's old (R2010a) and current (R2013a) tex and fonts infrastructure
For the standard font Computer Modern Roman, the old infrastructure contains
sys/tex/tfm/cmr10.tfm
sys/fonts/ttf/cm/cmr10.ttf
sys/fonts/type1/cm/cmr10.pfb
sys/fonts/type1/cm/cmr10.pfm
and the current
sys/tex/tfm/cmr10.tfm
sys/fonts/ttf/cm/mwa_cmr10.ttf
sys/fonts/ttf/cm/mwb_cmr10.ttf
sys/fonts/type1/cm/mwa_cmr10.pfb
sys/fonts/type1/cm/mwb_cmr10.pfb
The TeX font metric files are identical. The truetype and type1 files appear to contain the same glyph data, but have been split into files containing latin (mwa) and greek characters (mwb). The pfm file has simply disappeared.The old type1 files have a copyright notice 1997 by the AMS, the new ones 2011 by the MW.
This indicates that in order to make Computer Modern Sans from an old Matlab work in current Matlab, it might be sufficient to copy cmss10.ttf and cmss10.pfb to mwa_cmss10.ttf and mwa_cmss10.pfb, since the tfm file is still present (see question).
Which files are used in R2013a?
The additional dir and enc files in sys/fonts/type1 appear not to be used, because deleting them leaves screen rendering and eps generation fully functional.
I suspected that the ttf files are used for screen rendering and the pfb files for inclusion in generated eps files. The former appears not to be the case, because deleting all ttf files leaves screen rendering and eps generation fully functional, too. Matlab does complain, however, if the folder sys/fonts/ttf/cm does not exist!
This indicates that a) it's not necessary to bother with modifying the dir and enc files, and b) it's not necessary to copy the ttf file.
Is inserting new pfb files enough?
After cmss10.pfb from an old Matlab is copied to sys/fonts/type1/cm/mwa_cmss10.pfb, using Computer Modern Sans in an equation still makes Matlab warn that "cmss10 is not supported", and the screen rendering is not correct. Moreover, a generated eps file does not render correctly.
However, the generated eps file does include the contents of mwa_cmss10.pfb and the reason it doesn't work is that the included pfb file defines a font named "CMSS10", while the eps refers to a font named "mwa_cmss10". Instead of #Daniel E. Shub's solution to change the references in the eps, one can edit the file mwa_cmss10.pfb and change its \FontName to "mwa_cmss10". This might be done with a simple text editor applied to the pfb. However, the better way is to disassemble the pfb file to PostScript using t1disasm, change the PostScript, and then reassemble using t1asm. These tools are contained in the t1utils package on CTAN.
The resulting eps does still not work properly though: Characters are not correctly positioned, especially for larger font sizes.
This indicates that the presence of the pfb file alone does not provide Matlab with the correct font metrics, and that the dvi file generated by Matlab's LaTeX does not explicitly position characters but relies on the renderer having those metrics.
See tex.se for a question concerning a workaround for the second point.
Does "hacking" existing fonts work?
Daniel E. Shub proposed in his answer not to add fonts, but to overwrite those existing in the Matlab installation. There are two problems with this:
– The correct font metrics are still not available to Matlab. Overwriting a font therefore only works, and only approximately, if the metrics of the original font and those of the new one are similar.
Example:
– Screen rendering only works in some cases. For me, overwriting mwa_cmr10 with a patched cmss10 and using \rm did lead to Computer Modern Sans being rendered to screen and in the eps file, albeit with slightly wrong positioning. However, overwriting mwa_cmtt10 and using \tt did not lead to Computer Modern Sans being rendered on screen; instead, Computer Modern Typewriter was rendered.
This implies a) that there is another independent source of font metrics for Matlab's renderer. As far as I can tell, they come from none of the files under sys/tex or sys/fonts. b) Font outlines are only in some cases read from the pfb files in sys/fonts/type1/cm.
Conclusion
The inner workings of the dvi renderer in recent Matlab therefore remain mysterious. Possible candidates where the missing information may be hidden are toolbox/matlab/graphics/hardcopy.p and / or com/mathworks/hg/uij/TextRasterizer.class in java/jar/hg.jar.
I'll cease my investigations for the time being (and going to have a look at psfrag ;)
I made the comment on Undocumented Matlab that you refer to. Apparently, I grossly underestimated the difficulty of making the Matlab DVI viewer work with fonts. I have included a non-working solution in the hope that someone can understand the warning it generates. I also have a working solution that is a pretty big hack. I am using Matlab R2013a and TexLive 2013 on Linux. I am not sure what will happen on Mac or Windows.
Non working solution
My first approach was to overload the Matlab tex.m function so I can easily do things in LaTeX and only have to worry about the dvi file
function [dviout,errout,auxout] = tex(varargin)
fid = fopen('matlab.dvi');
dviout = fread(fid, 'uint8');
dviout = uint8(dviout);
fclose(fid);
errout = [];
auxout = [];
end
I then created matlab.dvi by processing
\documentclass{article}
\setlength\topmargin{-0.5in}
\setlength\oddsidemargin{0in}
\DeclareFontFamily{T1}{myfont}{}
\DeclareFontShape{T1}{myfont}{m}{n}{<-> [1.2] AuriocusKalligraphicus}{}
\begin{document}%
\setbox0=\hbox{\usefont{T1}{myfont}{m}{n}Some text with a distinct font $\alpha$}%
\copy0\special{bounds: \the\wd0 \the\ht0 \the\dp0}%
\end{document}%
I then copied the TexLive font to Matlab
# cp $TEXLIVEROOT/texmf-dist/fonts/type1/public/aurical/AuriocusKalligraphicus.pfb $MATLABROOT/sys/fonts/AuriocusKalligraphicus.pfb
I get the "expected" warnings from
>> text(0.0, 0.5, 'DOES NOT MATTER', 'Interpreter', 'LaTeX', 'FontSize', 20)
Warning: Font AuriocusKalligraphicus10 is not supported.
Warning: Font AuriocusKalligraphicus10 is not supported.
If I try and export the figure (with the missing fonts) to a pdf file via alt+f alt+r I get a whole bunch of warnings including the potentially useful
Warning: Missing
/usr/local/matlab/R2013a/sys/fonts/type1/cm/mwa_auriocuskalligraphicus10.pfb
Working hack solutiuon
After becoming feed up with not knowing what to call the pfb files, I decided to overwrite one that already works (cmr10).
At the CLI
# cp $MATLABROOT/sys/fonts/mwa_cmr10.pfb $MATLABROOT/sys/fonts/mwa_cmr10.pfb.bak
# cp $TEXLIVEROOT/texmf-dist/fonts/type1/public/aurical/AuriocusKalligraphicus.pfb $MATLABROOT/sys/fonts/mwa_cmr10.pfb
and at the Matlab prompt
>> text(0.0, 0.5, 'Some text with a distinct font $\alpha$', 'Interpreter', 'LaTeX', 'FontSize', 20)
gives me
.
In order to export the figure to an eps with the fonts you need to replace all the instances of /mwa_cmr10 with /AuriocusKalligraphicus in the eps file. Presumably this is because this solution is a hack. Ideally I should not only replace the pfb file, but also the fd and tfm files. There are probably enough pfb fonts available to allow you to create most figures.
This is a very crude solution, but you may edit the resulting .eps file using a text editor and get the desired fonts. For example you can replace following:
%%IncludeResource: font mwa_cmr10 /mwa_cmr10 /WindowsLatin1Encoding
120 FMSR
with following:
%%IncludeResource: font Helvetica /Helvetica /WindowsLatin1Encoding
120 FMSR
You may even write a simple script which would open the resulting .eps file and replace any font with anyone you desire. I hope this helps!

Display contents of text file in MATLAB shell

I'm using MATLAB under Windows, and trying to display (dump) the contents of a text file in the command shell. It seems like overkill to open a small file in the editor, or to load the file to use disp.
Use type and specify the explicit file name (including the extension), for instance:
type('myfile.txt')
As well as type, there's also dbtype which lets you pick a start and end range to print, and shows line numbers - handy for listing source files.

how to remove characters from a font file?

i've downloaded the DejaVu open source font and want to use it ad a WebFont, but even when converting it, i get a large file, and because the website i'll use will be only in few languages (arabic, french, amazigh) then, i dont need some characters.
so is there a way to browse the font file and delete the unnecessary range of unicode characters that i'll not need?
Using FontForge, you may open Element->Font Info->Unicode Ranges. You will see all available ranges and you can select a whole Unicode range with a single click. Then, you can tune your selection and delete using Encoding->Detach & Remove Glyphs.
Also, you can use Edit->Select->Select by Script.
The easiest method I found is to use pyftsubset tool from FontTools. Here's an example:
$ pyftsubset NotoSans-Regular.ttf \
--unicodes=U+0400-045F,U+0490-0491,U+04B0-04B1,U+2116 \
--output-file=NotoSans-Regular.cyrillic.woff2 \
--flavor=woff2
Note: woff2 output requires Brotli.
I wrote a simple script around it which automates the whole process including generation of a CSS file after splitting the font file. You may find it here: https://github.com/johncf/ttf2web