Tesseract tesstrain.sh can not find the font - tesseract

I am trying to train tesseract with the guide of:
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
But even though i did not put a font argument in my command line i get this error:
Could not find font named 'Arial Bold'.
Pango suggested font 'Palatino'.
Please correct --font arg.
ERROR: Program text2image failed. Abort.
I couldn't find that where am i making the mistake?enter image description here

So i find the solution in:https://groups.google.com/forum/#!topic/tesseract-ocr/CpxTGGUlWIo
I listed fonts whose i want to train on them to get their names with:nour#nour:~$ text2image --fonts_dir path/to/fonts --list_available_fonts
Then i edited the: language-specific.sh with the font names which i will train the tesseract on it.

Related

Error using text2image Font Exocet Light failed with 223518 hits = 99.94% when trying to build image file using Diablo 2 font

I am running tesseract on windows 11 using the command prompt.
The text file is my training data. Words that I want to turn into images.
The output is the next step in the Tesseract process for training my font.
I am saying find fonts but I only have one font in the folder.
text2image --text="C:\PythonProjects\DiabloTesseractTrainFont\text.txt" --outputbase="C:\PythonProjects\DiabloTesseractTrainFont\Output\Dia.font.exp0" --fontconfig_tmpdir="C:\PythonProjects\DiabloTesseractTrainFont" --find_fonts --fonts_dir="C:\PythonProjects\DiabloTesseractTrainFont\Diablo Fonts"
The result:
Total chars = 223645
Font Exocet Light failed with 223518 hits = 99.94%
Not sure why it fails. I have built something similar to this before. I have tried with a font file that I know has worked and it does the exact same thing.
Any help would be appreciated.
I solved it. In the text file, there were some characters that had been changed when I read them into python. I believe they used to be bullet points but when I read the file I had implemented in python ASCII encoding and ignore errors. I figured that those characters would be removed. I was wrong. Those bullet points were replaced with text that said PAD. I found it in notepad++ and highlighted one of them and then replaced them with a space. Note in Notepad++ when I did the replace it did not have anything in the find field but it still replaced all of them. Now it compiles just fine. I was stuck for many hours I hope this helps someone.

Can Tesseract be used for Sinhala handwritten text recognition?

I wish to restore damaged Sinhala handwritten documents. Please let me know: Can Tesseract be used for Sinhala language also?
Checkout the tessdata folder the from tesseract-ocr GitHub repository:
There's sin.traineddata for the actual Sinhala language, and
there's script/Sinhala.traineddata for the Sinhala script.
Copy one of them (or both) to your tessdata folder, maybe located at C:\tesseract\tessdata on some Windows machine.
For example, running Tesseract from the command line, you can then use
tesseract myimage.png output -l sin
or
tesseract myimage.png output -l Sinhala
I took a screenshot of the Sinhala script Wikipedia page, and cropped the following part:
Both above commands result in the following output:
සිංහල අක්ෂර මාලාව
That seems fine to me, but I don't claim to be able to read or understand any Sinhala script or language!
So, in general: Yes, it seems, you can OCR Sinhala texts!
BUT: As for any script, and maybe even more difficult for non-Latin scripts, you probably won't get good results on handwritten texts. OCR on those texts is some field of research on its own.

pytesseract results different from tesseract command line results

I am trying to convert a scanned page to text using both pytesseract and tesseract command line on Ubuntu. The results are remarkably different (pytesseract performs way better than tesseract command line) and I am unable to understand why. I looked at the default values for the parameters and tried altering some of the parameter values in tesseract command line (like psm ) but I am unable to get the same result as pytesseract. Due to lack of proper documentation in pytesseract I am not able to figure out what default values for parameters are used.
Here is my pytesseract code
print(pytesseract.image_to_string(Image.open('test.tiff'))
Looking at the source code of pytesseract, it seems the image is always converted into a .bmp file.
Working with a .bmp file and psm of 6 at the command line with Tesseract gives same result as pytesseract.
Also, tesseract can work with uncompressed bmp files only. Hence, if ImageMagick is used to convert .pdf to .bmp, the following will work
convert -density 300 -quality 100 mypdf.pdf BMP3:mypdf.bmp
tesseract mypdf.bmp -psm 6 mypdf txt
In tessaract v5 3.0+
Pytessaract does not convert images to BMP. You can verify this by commenting out cleanup(f.name) in the save context manager, which is found within the source code /pytesseract/pytesseract.py. The filename of the temp file will also need to retrieved (Pytessaract was saving files within temp files directory of the user, ie. "[path-to-user]\AppData\Local[file-name]". I found what Pytesseract is actually doing is in the prepare function.
Basically, taking the temp file and using that same file with the tesseract command directly will yeild the same results

How to make fonts available to the LaTeX interpreter in Matlab R2013a?

It is possible to embed LaTeX-formatted text and equations into Matlab plots by setting the text property 'Interpreter' to the value 'latex', e.g.
text(0.1, 0.5, 'Einstein: $E = m c^2$', ...
'Interpreter', 'latex', 'FontSize', 32)
These equations appear on screen as well as in illustrations exported to eps files.
Through the appropriate LaTeX commands, it is also possible to change the font from the default Computer Modern Serif to e.g. Computer Modern Typewriter
text(0.1, 0.5, '\fontfamily{cmtt}\selectfont Einstein: $E = m c^2$', ...
'Interpreter', 'latex', 'FontSize', 32)
My question is: Is it possible to insert additional fonts into the Matlab installation, such that these fonts become available for use with 'Interpreter' 'latex', for rendering on screen as well as producing eps files? And if yes, how?
Background
(All paths relative to the Matlab installation, /opt/MATLAB/R2013a on my Linux system.)
Matlab includes a customized version of the (La)TeX interpreter. It is called via a frontend m-file called tex.m in toolbox/matlab/graphics which takes LaTeX code as an argument and returns dvi data within its output argument. The customized LaTeX installation is found in sys/tex and includes TeX font metric files under sys/tex/tfm.
I do not have any information on the parts of Matlab that render this dvi. However, font data for rendering are found under sys/fonts/ttf and sys/fonts/type1.
Making additional fonts usable therefore consists of two parts: Making it available for the LaTeX interpreter, and making it available for the rendering function. The first part can be tackled by manipulating tex.m, such that it generates the dvi through an independent regular installation of LaTeX, and installing the font to this LaTeX in the usual way (e.g. font packages). See undocumentedmatlab.
The second part of the question is therefore the crucial one: How to insert additional fonts into sys/fonts/ttf and sys/fonts/type1 such that they become usable by the dvi renderer component of Matlab.
Concrete case
I tried to concretely solve the second problem for a special case: The Computer Modern Sans font is included in the Matlab-LaTeX installation through tex/tfm/cmss10.tfm, but the corresponding ttf and pfb-files are missing from sys/fonts such that it does not get rendered.
Matlab's collection of ttf-files does not appear to have some kind of inventory. I therefore simply copied the file cmss10.ttf from an installation of matplotlib to sys/fonts/ttf/cm/mwa_cmss10.ttf, following the file and folder naming conventions of the other files present. This procedure was reported to be working on Alec's Web Log for Matlab 2011b on Max OS X, but on my system it has no effect, neither for screen display nor eps export.
Matlab's collection of type1 fonts has a complex inventory, distributed over files fonts.dir, fonts.scale, encodings.dir and a folder encodings full of enc-files. Again I found cmss10.pfb, this time from a TeXlive installation, renamed and copied it, and made entries in the inventory files following the example of the other fonts listed. Again, this procedure has no effect at all.
Does anyone know more about how Matlab uses ttf and pfb-files, and can give me a hint on how to make the cmss10-files accessible to Matlab rendering? Or does anyone have a suggestion how to debug this and find out more about the inner workings of Matlab's LaTeX support?
I invested hours of further research into my question, and came up with some interesting new insights, but no real solution. Still, I'm posting my results here in order for others who might investigate this to start from. I post it as an "answer" not make my already long question even longer.
Comparison between Matlab's old (R2010a) and current (R2013a) tex and fonts infrastructure
For the standard font Computer Modern Roman, the old infrastructure contains
sys/tex/tfm/cmr10.tfm
sys/fonts/ttf/cm/cmr10.ttf
sys/fonts/type1/cm/cmr10.pfb
sys/fonts/type1/cm/cmr10.pfm
and the current
sys/tex/tfm/cmr10.tfm
sys/fonts/ttf/cm/mwa_cmr10.ttf
sys/fonts/ttf/cm/mwb_cmr10.ttf
sys/fonts/type1/cm/mwa_cmr10.pfb
sys/fonts/type1/cm/mwb_cmr10.pfb
The TeX font metric files are identical. The truetype and type1 files appear to contain the same glyph data, but have been split into files containing latin (mwa) and greek characters (mwb). The pfm file has simply disappeared.The old type1 files have a copyright notice 1997 by the AMS, the new ones 2011 by the MW.
This indicates that in order to make Computer Modern Sans from an old Matlab work in current Matlab, it might be sufficient to copy cmss10.ttf and cmss10.pfb to mwa_cmss10.ttf and mwa_cmss10.pfb, since the tfm file is still present (see question).
Which files are used in R2013a?
The additional dir and enc files in sys/fonts/type1 appear not to be used, because deleting them leaves screen rendering and eps generation fully functional.
I suspected that the ttf files are used for screen rendering and the pfb files for inclusion in generated eps files. The former appears not to be the case, because deleting all ttf files leaves screen rendering and eps generation fully functional, too. Matlab does complain, however, if the folder sys/fonts/ttf/cm does not exist!
This indicates that a) it's not necessary to bother with modifying the dir and enc files, and b) it's not necessary to copy the ttf file.
Is inserting new pfb files enough?
After cmss10.pfb from an old Matlab is copied to sys/fonts/type1/cm/mwa_cmss10.pfb, using Computer Modern Sans in an equation still makes Matlab warn that "cmss10 is not supported", and the screen rendering is not correct. Moreover, a generated eps file does not render correctly.
However, the generated eps file does include the contents of mwa_cmss10.pfb and the reason it doesn't work is that the included pfb file defines a font named "CMSS10", while the eps refers to a font named "mwa_cmss10". Instead of #Daniel E. Shub's solution to change the references in the eps, one can edit the file mwa_cmss10.pfb and change its \FontName to "mwa_cmss10". This might be done with a simple text editor applied to the pfb. However, the better way is to disassemble the pfb file to PostScript using t1disasm, change the PostScript, and then reassemble using t1asm. These tools are contained in the t1utils package on CTAN.
The resulting eps does still not work properly though: Characters are not correctly positioned, especially for larger font sizes.
This indicates that the presence of the pfb file alone does not provide Matlab with the correct font metrics, and that the dvi file generated by Matlab's LaTeX does not explicitly position characters but relies on the renderer having those metrics.
See tex.se for a question concerning a workaround for the second point.
Does "hacking" existing fonts work?
Daniel E. Shub proposed in his answer not to add fonts, but to overwrite those existing in the Matlab installation. There are two problems with this:
– The correct font metrics are still not available to Matlab. Overwriting a font therefore only works, and only approximately, if the metrics of the original font and those of the new one are similar.
Example:
– Screen rendering only works in some cases. For me, overwriting mwa_cmr10 with a patched cmss10 and using \rm did lead to Computer Modern Sans being rendered to screen and in the eps file, albeit with slightly wrong positioning. However, overwriting mwa_cmtt10 and using \tt did not lead to Computer Modern Sans being rendered on screen; instead, Computer Modern Typewriter was rendered.
This implies a) that there is another independent source of font metrics for Matlab's renderer. As far as I can tell, they come from none of the files under sys/tex or sys/fonts. b) Font outlines are only in some cases read from the pfb files in sys/fonts/type1/cm.
Conclusion
The inner workings of the dvi renderer in recent Matlab therefore remain mysterious. Possible candidates where the missing information may be hidden are toolbox/matlab/graphics/hardcopy.p and / or com/mathworks/hg/uij/TextRasterizer.class in java/jar/hg.jar.
I'll cease my investigations for the time being (and going to have a look at psfrag ;)
I made the comment on Undocumented Matlab that you refer to. Apparently, I grossly underestimated the difficulty of making the Matlab DVI viewer work with fonts. I have included a non-working solution in the hope that someone can understand the warning it generates. I also have a working solution that is a pretty big hack. I am using Matlab R2013a and TexLive 2013 on Linux. I am not sure what will happen on Mac or Windows.
Non working solution
My first approach was to overload the Matlab tex.m function so I can easily do things in LaTeX and only have to worry about the dvi file
function [dviout,errout,auxout] = tex(varargin)
fid = fopen('matlab.dvi');
dviout = fread(fid, 'uint8');
dviout = uint8(dviout);
fclose(fid);
errout = [];
auxout = [];
end
I then created matlab.dvi by processing
\documentclass{article}
\setlength\topmargin{-0.5in}
\setlength\oddsidemargin{0in}
\DeclareFontFamily{T1}{myfont}{}
\DeclareFontShape{T1}{myfont}{m}{n}{<-> [1.2] AuriocusKalligraphicus}{}
\begin{document}%
\setbox0=\hbox{\usefont{T1}{myfont}{m}{n}Some text with a distinct font $\alpha$}%
\copy0\special{bounds: \the\wd0 \the\ht0 \the\dp0}%
\end{document}%
I then copied the TexLive font to Matlab
# cp $TEXLIVEROOT/texmf-dist/fonts/type1/public/aurical/AuriocusKalligraphicus.pfb $MATLABROOT/sys/fonts/AuriocusKalligraphicus.pfb
I get the "expected" warnings from
>> text(0.0, 0.5, 'DOES NOT MATTER', 'Interpreter', 'LaTeX', 'FontSize', 20)
Warning: Font AuriocusKalligraphicus10 is not supported.
Warning: Font AuriocusKalligraphicus10 is not supported.
If I try and export the figure (with the missing fonts) to a pdf file via alt+f alt+r I get a whole bunch of warnings including the potentially useful
Warning: Missing
/usr/local/matlab/R2013a/sys/fonts/type1/cm/mwa_auriocuskalligraphicus10.pfb
Working hack solutiuon
After becoming feed up with not knowing what to call the pfb files, I decided to overwrite one that already works (cmr10).
At the CLI
# cp $MATLABROOT/sys/fonts/mwa_cmr10.pfb $MATLABROOT/sys/fonts/mwa_cmr10.pfb.bak
# cp $TEXLIVEROOT/texmf-dist/fonts/type1/public/aurical/AuriocusKalligraphicus.pfb $MATLABROOT/sys/fonts/mwa_cmr10.pfb
and at the Matlab prompt
>> text(0.0, 0.5, 'Some text with a distinct font $\alpha$', 'Interpreter', 'LaTeX', 'FontSize', 20)
gives me
.
In order to export the figure to an eps with the fonts you need to replace all the instances of /mwa_cmr10 with /AuriocusKalligraphicus in the eps file. Presumably this is because this solution is a hack. Ideally I should not only replace the pfb file, but also the fd and tfm files. There are probably enough pfb fonts available to allow you to create most figures.
This is a very crude solution, but you may edit the resulting .eps file using a text editor and get the desired fonts. For example you can replace following:
%%IncludeResource: font mwa_cmr10 /mwa_cmr10 /WindowsLatin1Encoding
120 FMSR
with following:
%%IncludeResource: font Helvetica /Helvetica /WindowsLatin1Encoding
120 FMSR
You may even write a simple script which would open the resulting .eps file and replace any font with anyone you desire. I hope this helps!

How to find parameters supported in Tesseract OCR config file

I want to know what parameters the config file used by Tesseract OCR accepts, how to write a config file, etc.
I can't find any documentation about this on their site. How can I determine what parameters are supported, and what they mean?
I found these instructions in the link below. They are about writing the config file and where to place it:
config file is simple text file without BOM and with Unix end-of-line mark (on Windows you can use some advanced text editor e.g. Notepad++ to achieve this).
If you use tesseract executable this is only way how to change tesseract parameters.
config file should be located in your tessdata/configs directory. Have a look there for some examples.
There is a list of all the variables plus descriptions of each one in http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version. Note it's for Tesseract 3.02, things may be different in other versions.
Edit: Also adding a pastebin link in case the above link becomes dead.
Tesseract v3.04 now offers the command line option --print-parameters, so you can call tesseract --print-parameters to get a list of the 678 (!) configurable parameters, their default values, and a short description:
Tesseract parameters:
editor_image_xpos 590 Editor image X Pos
editor_image_ypos 10 Editor image Y Pos
editor_image_menuheight 50 Add to image height for menu bar
editor_image_word_bb_color 7 Word bounding box colour
editor_image_blob_bb_color 4 Blob bounding box colour
editor_image_text_color 2 Correct text colour
...and many, many more
It's just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line; for instance:
interactive_display_mode T
tessedit_display_outwords T
There are several standard config files -- such as digits, hocr -- under Tesseract tessdata/configs folder.