I wish to restore damaged Sinhala handwritten documents. Please let me know: Can Tesseract be used for Sinhala language also?
Checkout the tessdata folder the from tesseract-ocr GitHub repository:
There's sin.traineddata for the actual Sinhala language, and
there's script/Sinhala.traineddata for the Sinhala script.
Copy one of them (or both) to your tessdata folder, maybe located at C:\tesseract\tessdata on some Windows machine.
For example, running Tesseract from the command line, you can then use
tesseract myimage.png output -l sin
or
tesseract myimage.png output -l Sinhala
I took a screenshot of the Sinhala script Wikipedia page, and cropped the following part:
Both above commands result in the following output:
සිංහල අක්ෂර මාලාව
That seems fine to me, but I don't claim to be able to read or understand any Sinhala script or language!
So, in general: Yes, it seems, you can OCR Sinhala texts!
BUT: As for any script, and maybe even more difficult for non-Latin scripts, you probably won't get good results on handwritten texts. OCR on those texts is some field of research on its own.
Related
I want to use a package called "ESC" elliptic surface calculator that can be uploaded with Maple.
The instructions from the creator are:
Save as a text file with character encoding ISO-8859-1 (ISO Latin 1)
and load within Maple using "read" command).
I have problem with uploading to Maple and saving with this encoding. Can anybody say the exact upload command with details and how to save text with this encoding?
Here is the package page: http://c-faculty.chuo-u.ac.jp/~kuwata/2012-13/Maple_resources/ESC.mpl
I use Windows 8.1 and Maple 18. Thanks!
On the webpage, just right click on the source code file and then click save as a text file.
After that, open maple work sheet and read the file ESC.mpl.txt, like this
restart:
read("C:/tcdata/ESC.mpl.txt") # I have save the file in a folder named `tcdata` in drive C.
Once the file is loaded/read in maple, you can do whatever you are suppose to do. I tried these, to check whether the source code is working or not.
ESC();
elliptic_surface(1,1,1,1,1);
Apparently, the source file has been read and is working properly.
in a textbook, examples on the book companion website are given in Maple, using a .mws file format.
I do not have Maple but I am interested in studying the code of the examples.
I wonder if there is a conversion tool for Linux that allows me to export to text or at least to view the content of the script.
I am aware there exist a free Maple player but from the product description it is not clear if it allows to see the script content or just "play" with the inputs it defines.
I did also try this Maple-to-Python converter, but it's very alpha and it just doesn't work.
The Maple Player should definitely be able to open any .mws file. You won't be able to run commands, but you can at least read the code contained in the file.
Maple itself can export .mws files to text, but other than that, I haven't heard of any other converters for extracting the code from these files.
I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.
In a perl script, I try to convert svg files to pdf. This works great by just refering to Inkscape:
system "inkscape -D -z --file=$in --export-pdf=$out";
But it is enormously slow even for little 100 KB files, I mean it can be minutes per file, causing the script to fail when running with a time-out constrain, eg. on a webserver.
To speed up, I have read about svg2pdf as a standalone, but never found a binary for Win7 or managed to compile it, even with the libcairo dlls present.
My last idea now is to use the CPAN module Cairo. It makes me hoping that it can convert an svg file to pdf, but in the documentation I only find drawings and surfaces, but no method to write/convert.
Has anyone experience with that?
Making my comment an answer: You could try rsvg-convert which is part of the librsvg library. It's probably faster than Inkscape but it's still an external command.
I want to know what parameters the config file used by Tesseract OCR accepts, how to write a config file, etc.
I can't find any documentation about this on their site. How can I determine what parameters are supported, and what they mean?
I found these instructions in the link below. They are about writing the config file and where to place it:
config file is simple text file without BOM and with Unix end-of-line mark (on Windows you can use some advanced text editor e.g. Notepad++ to achieve this).
If you use tesseract executable this is only way how to change tesseract parameters.
config file should be located in your tessdata/configs directory. Have a look there for some examples.
There is a list of all the variables plus descriptions of each one in http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version. Note it's for Tesseract 3.02, things may be different in other versions.
Edit: Also adding a pastebin link in case the above link becomes dead.
Tesseract v3.04 now offers the command line option --print-parameters, so you can call tesseract --print-parameters to get a list of the 678 (!) configurable parameters, their default values, and a short description:
Tesseract parameters:
editor_image_xpos 590 Editor image X Pos
editor_image_ypos 10 Editor image Y Pos
editor_image_menuheight 50 Add to image height for menu bar
editor_image_word_bb_color 7 Word bounding box colour
editor_image_blob_bb_color 4 Blob bounding box colour
editor_image_text_color 2 Correct text colour
...and many, many more
It's just a plain text file containing space-delimited key/value pairs for Tesseract config variables, each on separate line; for instance:
interactive_display_mode T
tessedit_display_outwords T
There are several standard config files -- such as digits, hocr -- under Tesseract tessdata/configs folder.