tesseract ocr command line for signe character

tesseract ocr command line for signe character - command-line

I'm trying run tesseract-ocr over this image, unsuccessfully :
> wget http://i.imgur.com/dOtlrvx.png
...
> convert dOtlrvx.png dOtlrvx.tif
> tesseract dOtlrvx.tif out -psm 10 && cat out.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0
.
The recognized char is a dot "."
-psm 10 stands for "treat the image as a single character" so I think its the correct option to use. I also tried with other psm possible values, it does not work neither.
Anyone has an idea why is this not working ? Any suggestion is welcomed !
Thanks

Create a new config file for tesseract, add this line tessedit_char_whitelist 0123456789 and then process your image: tesseract dOtlrvx.tif out -psm 10 your_config_file.
This worked for me.

Related

How do I set up configuration variables in Tesseract to better recognize code?

I want to use Tesseract to recognize code. It is said on their website that I can disable dictionaries by setting both of the configuration variables load_system_dawg and load_freq_dawg to false.
However I haven't been able to do it correctly.
$ tesseract img.jpg output.txt --oem 0 -c load_system_dawg=0 load_freq_dawg=0
read_params_file: Can't open load_freq_dawg=0
Error: Tesseract (legacy) engine requested, but components are not present in /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Any ideas on best ways to handle it?

First of all, get eng.traineddata with the legacy engine or other OCR engine value (OEM).
Next, read the output of tesseract --help-extra carefully:
-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed.

Disable output of "Page x" in tesseract when using stdout as output [duplicate]

This question already has answers here:
Tesseract quiet mode
(2 answers)
Closed 1 year ago.
I try to use tesseract for OCR of pictures and I would like to disable the somewhat verbose output of the pages tesseract is scanning:
:~$ tesseract stdin stdout -l eng txt
Page 1
<ocr output>
Is it possible to remove the "Page 1" from the output?
:~$ tesseract --version
tesseract 4.0.0-146-gc39a

Try quiet option at the end of the command.

If you meant you only wanted to see the OCR'd text then just redirect stderr to null.
foo | tesseract - - 2>/dev/null
Or of course, to a log file if you so desire.

Need help to write a basic Command Line code

I'm using Windows 10 if it matters and I'm trying to feed a file to the "oeminst" app that will convert this file from .EDR to .CCSS. According to the app's website its usage summary is this:
oeminst [-options] [inputfiles]
-v Verbose
-n Don't install, show where files would be installed
-c Don't install, save files to current directory
-S d Specify the install scope u = user (def.), l = local system]
infile Manufacturers setup.exe install file(s) or .dll(s) containing install files
infile.[edr|ccss|ccmx] EDR file(s) to translate and install or CCSS or CCMX files to install
If no file is provided, oeminst will look for the install CD.
more info can be found here https://www.argyllcms.com/doc/oeminst.html
So far I tried this code:
C:\Users\PC>oeminst infile. [C:\Users\PC\testfile.edr]
oeminst: Error - Unable to load file 'infile [C:\Users\PC\testfile]'
I'd appreciate if someone at least could tell me if I'm doing it right or not.
P.S. sorry for the messed up text. Not sure how to fix it. It looks good in editing mode.

Try this : oeminst infile.edr C:\Users\PC\testfile.edr

Nevermind, I got it.
C:\Users\PC>oeminst C:\Users\PC\testfile.edr

How do you save a file to an output.txt when using tessedit_char_whitelist in tesseract?

I have managed to use
tesseract image.jpg output.txt
to read the text on an image file and save it as a text file, but now I am trying to use more specific commands with tesseract and it is trying to open the output file rather than saving into it
I am trying to use
tesseract image.jpg stdout -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ%/-15 TextOutput
I have literally just started using tesseract so I may well be making a stupid mistake

I figured out that if you insert a > after the specific commands it works
like this
tesseract image.jpg stdout -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ%/-1250 > TextOutput.txt

gsutil will not download files to my windows machine from powershell

gsutil -m cp -R 'gs://[BUCKET]/' 'C:/Users/[USER]/[FOLDER]'
will display the following error
[Errno 22] invalid mode ('ab') or filename: u'C:\\Users\\[USER]\\[FOLDER]\\\\[BUCKET]\\[FILE].gstmp'
I've tried changing the '/'s to '//' to '\' and '\' with no results whatsoever

So, after hours trying to find out this was happening.. it happened that the filenames had a character that can't be used in filenames in windows.. hope this helps if anybody else runs into this error.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

tesseract ocr command line for signe character - command-line

Create a new config file for tesseract, add this line tessedit_char_whitelist 0123456789 and then process your image: tesseract dOtlrvx.tif out -psm 10 your_config_file. This worked for me.

Related

How do I set up configuration variables in Tesseract to better recognize code?

Disable output of "Page x" in tesseract when using stdout as output [duplicate]

Need help to write a basic Command Line code

How do you save a file to an output.txt when using tessedit_char_whitelist in tesseract?

gsutil will not download files to my windows machine from powershell

Categories

Resources