How to get character wise confidence in tesseract using command line?

How to get character wise confidence in tesseract using command line? - tesseract

I am able to get word level confidence score using tesseract 4.0 through the command line. Interested to know if there is a way to get the character confidence too.
For word level confidence used the below command:
tesseract [Image name] outputbase --oem 1 -l eng --psm 8 tsv

Set hocr_char_boxes to 1 in your config file. Or, at the command line, your updated command would be:
tesseract [Image name] outputbase --oem 1 -l eng --psm 8 -c hocr_char_boxes=1 hocr
Note the hocr output option and look in that file for ..._wconf, e.g.
<span class='ocrx_word' id='word_1_1' title='bbox 127 344 4618 6915; x_wconf 1'>
Let me know if this works for you, otherwise I'll just delete the answer.
Source: https://github.com/tesseract-ocr/tesseract/issues/1465#issuecomment-513139976

Related

Using sed, delete from specific line until first match(not included)

I have some data looks like
1:Alice 2313
2:Desctop 456
3:Cook 111
4:.filename 50
...
...
100:Good 3
Dir num:10
File num:90
...
...
I want to delete all lines from specific line(ex. line 3) until the line "Dir num:" show up.
The idea output should be(according above example):
1:Alice 2313
2:Desctop 456
Dir num:10
File num:90
...
...
I have google several solutions likesed -i '/somestring/,$!d' file.
But these solutions are not suitable because of the specific line where deletion satarting.
How can I do this in 1 command without any tmp file?
Forgive my poor English, I'm not native English speaker.

You need to specify the address range from the specified line number (3) to the line matching the pattern (/Dir num/). However, it's not quite as simple as
sed '3,/Dir num/ d' file
because that will delete the "Dir num" line. Try this instead:
sed '3,/Dir num/ {/Dir num/! d}' file
That will, for the lines in the range, check that the line does not match the pattern: is the pattern is not matched, delete it.

Use the range: /pattern1/,/pattern2/ option of sed
$ sed -e '/2:Desctop 456/,/Dir num:10/{//!d}' inputFile
1:Alice 2313
2:Desctop 456
Dir num:10
File num:90
...
...

Listing the volumes on Solaris OS

I am new to solaris OS, and trying to write a script which collects volume data from solaris box.
We did a similar script for Linux, and we used "df -P" command to list the volumes, and select the entries that start with "/dev".
By default, in linux, i could see a volume "/dev/sda1".
when i run df command on solaris box(df -k),i could not see any entry similar to (/dev/*) in my output.
When i mounted a CD, i could see an entry in df output as below.
/dev/dsk/c1t1d0s2 57632 57632 0 100% /media/VBOXADDITIONS_5.0.14_105127
So, in solaris, what is the pattern, i should look for to pick the volumes?
And, why am I not seeing at least one volume in the pattern /dev/
is it "/dev" or something else?
I am using solaris 11 image on oracle virtual box.
When i try "format" command, i could see 3 disks:
AVAILABLE DISK SELECTIONS:
0. c1d0 <VBOX HAR-8ea18e8b-2b2a0a5-0001-31.25GB> testvolu
/pci#0,0/pci-ide#1,1/ide#0/cmdk#0,0
1. c2d0 <VBOX HAR-b4343b55-dbed77c-0001 cyl 1020 alt 2 hd 64 sec 32>
/pci#0,0/pci-ide#1,1/ide#1/cmdk#0,0
2. c3t0d0 <ATA-VBOX HARDDISK-1.0 cyl 1009 alt 2 hd 64 sec 32>
/pci#0,0/pci8086,2829#d/disk#0,0
But, i dont see any partition in "df -k"
Also, i read here(https://docs.oracle.com/cd/E19455-01/805-6331/6j5vgg680/index.html), that disk names should be in "/dev/dsk/*" format.

Solaris 11 uses ZFS which has no one to one relationship between volumes (partitions) and file systems.
You can look at zpool status output to get the underlying devices.
$ zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
Here, the whole c1t0d0 disk is used, hence no sx or px suffix.

save ELKI output to a file

I want to save the cluster outputs to a file. for example I want to save cluster1 points into c1.txt and cluster2 points into c2.txt and so on.
ELKI release 0.7
java -jar elki.jar -dbc.in ./f1 -dbc.out ./dir1 -algorithm clustering.DBSCAN -dbscan.epsilon 5 -dbscan.minpts 10
but it has this error:
the following parameter were not processed: [-dbc.out,./dir1]
The command could not be correct, so how can I save clusters?

To save to a file, use
-resulthandler ResultWriter
-out folder/
to limit what files are being written, you can use -out.filter.

Tesseract OCR mftraining Error code 2001 with 2 fonts in the font_properties file

I am trying create new language data for Japanese with 2 fonts.
1. Arial Unicode MS
2. MS ゴシック　(MS Gothic)
I am not sure how to create font_properties file with command line for two fonts.
Usually I run > echo Arial_Unicode_MS 0 0 1 0 0 >font_properties -- to create font properties file with one font.
Since, I am going to use two fonts I edited the file to add the second font.
But, when I execute mftraining , for the first font whether it is Arial Unicode MS or MS ゴシック it just works fine. But I get "malloc allocation error 2001 iff I reference the second font in the file.
I even used Serak trainer to create font_properties file.
I want to do, as shown below.
1. mftraining.exe -F font_properties -U unicharset -O lang.unicharset lang.font1.exp0.tr
2.mftraining.exe -F font_properties -U unicharset -O lang.unicharset lang.font2.exp0.tr
#1 throws no error if font1 is the first font in the file. But getting error with #2
#2 throws no error if font2 is the first font in the file. But getting error with #1
What is wrong with my steps?
Regards,
Sharon

Your TIFF/Box files should have "Arial_Unicode_MS" in their names, not "font1" or "font2".

finding error in command line

I am trying to run some kind of programm using command line, but I got an error.
The command line is:
quantisnp2.exe --outdir D:\output\ --config "C:\Program files\QuantiSNP\params.dat" --levels "C:\Program files\QuantiSNP\levels.dat" --sampleid CNV1 --gender female --emiters 10 --Lsettings 2000000 --doXcorrect --genotypes --gcdir D:\gc\ --input-files C:\Program files\CNV1.txt
QuantiSNP:Single-file mode input find.
QuantiSNP:Processing file: C:|Program
QuantiSNP:Local CG content directory specified. Local CG content correction will be used.
??? Error using ==>textread at 167
File not found.
Error in ==> quantisnp2 at 293

The first thing I'd be looking at is the unquoted C:\Program files\CNV1.txt at the end of the command (all your other ones are quoted).
There's a good chance that's being treated as two arguments, C:\Program and files\CNV1.txt.
You may also want to check the spelling of emiters, I'm pretty certain the correct English word would be emitters though, of course, this could be a case of the QuantiSNP developers not knowing how to spell :-)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to get character wise confidence in tesseract using command line? - tesseract

I am able to get word level confidence score using tesseract 4.0 through the command line. Interested to know if there is a way to get the character confidence too. For word level confidence used the below command: tesseract [Image name] outputbase --oem 1 -l eng --psm 8 tsv

Related

Using sed, delete from specific line until first match(not included)

Listing the volumes on Solaris OS

save ELKI output to a file

Tesseract OCR mftraining Error code 2001 with 2 fonts in the font_properties file

finding error in command line

Categories

Resources