How to get Hocr output using python-tesseract

How to get Hocr output using python-tesseract - tesseract

I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me.
And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of specifying config file using pytessearct.
So, is it possible to specify cofiguration file using pytesseract or is there some default config file that i can change to get hocr output?
#run method from pytessearct.py
def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None):
'''
runs the command:
`tesseract_cmd` `input_filename` `output_filename_base`
returns the exit status of tesseract, as well as tesseract's stderr output
'''
command = [tesseract_cmd, input_filename, output_filename_base]
if lang is not None:
command += ['-l', lang]
if boxes:
command += ['batch.nochop', 'makebox']
if config:
command += shlex.split(config)
#command+=['C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\configs\\hocr']
#print "command:",command
proc = subprocess.Popen(command,
stderr=subprocess.PIPE)
return (proc.wait(), proc.stderr.read())

You can use this another library to use Tesseract in Python: pyslibtesseract
Image:
Code:
import pyslibtesseract
tesseract_config = pyslibtesseract.TesseractConfig(psm=pyslibtesseract.PageSegMode.PSM_SINGLE_LINE, hocr=True)
print(pyslibtesseract.LibTesseract.simple_read(tesseract_config, 'phrase0.png'))
Output:
<div class='ocr_page' id='page_1' title='image ""; bbox 0 0 319 33; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 0 0 319 33">
<p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 10 13 276 25">
<span class='ocr_line' id='line_1_1' title="bbox 10 13 276 25; baseline 0 0"><span class='ocrx_word' id='word_1_1' title='bbox 10 14 41 25; x_wconf 75' lang='eng' dir='ltr'><strong>the</strong></span> <span class='ocrx_word' id='word_1_2' title='bbox 53 13 97 25; x_wconf 84' lang='eng' dir='ltr'><strong>book</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 111 13 129 25; x_wconf 79' lang='eng' dir='ltr'><strong>is</strong></span> <span class='ocrx_word' id='word_1_4' title='bbox 143 17 164 25; x_wconf 83' lang='eng' dir='ltr'>on</span> <span class='ocrx_word' id='word_1_5' title='bbox 178 14 209 25; x_wconf 75' lang='eng' dir='ltr'><strong>the</strong></span> <span class='ocrx_word' id='word_1_6' title='bbox 223 14 276 25; x_wconf 76' lang='eng' dir='ltr'><strong>table</strong></span>
</span>
</p>
</div>
</div>

This worked for me :)
from pytesseract import pytesseract
pytesseract.run_tesseract('image.png', 'output', lang=None, boxes=False, config="hocr")
where : image.png is the image file besides this python file. The output file named output.hocr will be generated next to these files. Open this file in text editor to see the hocr output.

Just add hocr at the end of your command like this
tesseract input_filename output_filename_base hocr
Output file will be a html file

I've edited krozaine's answer:
import pytesseract
pytesseract.run_and_get_output('image.jpg', 'hocr', lang=None, config="hocr")

Related

import txt file in SAS

I have a text file with comments that I need to import in SAS.
the text file look like this
# DATA1
#
# --
#
ID nbmiss x1 x2 x3 x4
1 1 45 38 47
2 0 37 45 39 51
3 3 58
4 4
5 0 68 45 73 76
6 2 52 48
my output in SAS must look like this
Obs x1 x2 x3 x4
1 . 45 38 47
2 37 45 39 51
3 . . . 58
4 . . . .
5 68 45 73 76
6 . . 52 48
here is what I did. It gives me what I am looking for but it's long. I think there is a more simple way.
proc import datafile= 'Z:\bloc1data\data\data1.txt'
out=class
dbms=dlm
replace;
datarow=6;
delimiter='09'x;
run;
proc print data = work.class label;
var VAR3 VAR4 VAR5 VAR6;
label VAR3='x1' VAR4='x2' VAR5='x3' VAR6='x4';
run;
My question is how to have the same output in a simplify way?
Thank you for your time.

This is the part that's doing the import:
proc import datafile= 'Z:\bloc1data\data\data1.txt'
out=class
dbms=dlm
replace;
datarow=6;
delimiter='09'x;
run;
That seems pretty short to me. Four actual lines of code, around a hundred characters... The equivalent code in the data step is basically the same.
data want;
infile 'z:\bloc1data\data\data1.txt' dlm='09'x dsd firstobs=6;
input id nbmiss x1 x2 x3 x4;
run;
That file unfortunately doesn't work well for determining the names automatically (which otherwise you could do). DBMS=DLM does not have a namerow option to tell it where to pick names up from, so you would need to preprocess the file to remove the extraneous lines to do that. You're welcome to ask as a separate question how to do so, but it's not "simpler" than the above (though it is probably "better").

Modifying perl script to not print duplicates and extract sequences of a certain length

I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.
Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)
Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acur_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250
I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################
my($progname) = $0;
############################################################################
# Initialize variables
############################################################################
my($jter);
my($com);
my($t1);
if ( #ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}
my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];
############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my #data = <$INF>;
print #data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";
############################################################################
# Extract hits and place sequences into new file
############################################################################
my #temparray;
my #templine;
my($seqfname);
open ($INF, $infile) or die "Could not open file $infile for input.\n";
#temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}
# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...
# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence
Using this script, here is the output I'm getting:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.
For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get
AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC
at that base pair range, not
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?
Thanks, and sorry for the lengthy post!
Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Here is one such infile with 9 hits, from which I was expecting only 4 sequences.
Note: this issue has been largely resolved based on the solution provided below the answer section
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Apil_comp16418_c0_seq1 2079 1 1 1587 1 416 2002 0.0 2931 100.00 1587 1587
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Atri_comp13712_c0_seq1 1938 1 1 1587 1 1651 75 0.0 1221 80.73 1286 1593
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Ctom_01003023.1 2162 1 1 1406 1 1403 1 0.0 1430 85.07 1197 1407
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Apil_comp16418_c0_seq1 2079 1 1 1437 1 1866 430 0.0 1170 81.43 1175 1443
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Atri_comp13712_c0_seq1 1938 1 1 1441 1 201 1641 0.0 2662 100.00 1441 1441
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Acur_01000228.1 2415 1 1 1440 1 2231 797 0.0 1906 90.62 1305 1440
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Apil_comp16418_c0_seq1 2079 1 3 1284 1 1714 430 0.0 1351 85.69 1102 1286
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Acur_01000228.1 2415 1 1 1287 1 2084 797 0.0 1219 83.81 1082 1291
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Ctom_01003023.1 2162 1 1 1289 1 106 1394 0.0 2381 100.00 1289 1289
Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Apil_comp19456_c0_seq1 3549 1 1 2464 1 761 3224 0.0 4551 100.00 2464 2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Btri_comp15171_c0_seq1 3766 1 1 2456 1 3046 591 0.0 1877 80.53 1985 2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Apil_comp19456_c0_seq1 3549 1 1 2457 1 3214 758 0.0 1879 80.54 1986 2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Atri_comp28646_c0_seq1 1403 1 1256 2454 1 1401 203 0.0 990 81.60 980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Btri_comp15171_c0_seq1 3766 1 1 2457 1 593 3049 0.0 4538 100.00 2457 2457

You can use hash to remove duplicates
The bellow code remove duplicates depending on their subject length (keep larger subject length rows).
Just update your # iterate through the individual hits part with
# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(exists $filterhash{$templine[2]} ){
if($filterhash{$templine[2]} < $subject_length){
$filterhash{$templine[2]}= $subject_length;
}
}
else{
$filterhash{$templine[2]}= $subject_length;
}
}
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(not exists $printhash{$templine[2]})
{
$printhash{$templine[2]}=1;
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
else{
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
#print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
} # end for ($jter=0; $jter<$t1...
Hope this will help you.
Edit part update
for negative stand you need to replace
$subject_length = $templine[9] -$templine[8];
with
if($templine[8] > $templine[9]){
$subject_length = $templine[8] -$templine[9];
}else{
$subject_length = $templine[9] -$templine[8];
}
You also need to update your extract_from_genome2 code for negative strand sequences.

Data-event-click only called once, why?

I am making a custom widget in Shopify Dashing with the following code:
<div class="main">
<div id="bar">
Info
</div>
<div id="summary" data-bind="summary"></div>
<div id="info" data-bind="info"></div>
<h2 data-bind="more"></h3>
</div>
When the "Info" tag is clicked it is supposed to toggle the views shown in the widget. This works fine the first time it is clicked and the function toggleDetails is called as expected. When it is clicked a second or third time however nothing happens, the function is never called.
Why is this? I really don't understand.
This is my coffescript class:
1 class Dashing.Toggle extends Dashing.Widget
2
3 ready: ->
4 # This is fired when the widget is done being rendered
5 ##showDetails = false
6 #showSummary = true
7 #initiateView()
8
9 onData: (data) ->
10 # Handle incoming data
11 # You can access the html node of this widget with `#node`
12 # Example: $(#node).fadeOut().fadeIn() will make the node flash each time data comes in.
13
14
15 #accessor 'toggleDetails', ->
16 #switchView()
17 console.log "toggling 1"
18
19 initiateView: ->
20 #summaryView().show()
21 #infoView().hide()
22
23 switchView: ->
24 console.log "Switching"
25 if #showSummary
26 #summaryView().fadeOut #animationLength()
27 #infoView().fadeIn #animationLength()
28 #showSummary = false
29 else
30 #infoView().fadeOut #animationLength()
31 #summaryView().fadeIn #animationLength()
32 #showSummary = true
33
34 summaryView: ->
35 console.log "getting summary view"
36 $(#node).find("#summary")
37
38 infoView: ->
39 console.log "getting info view"
40 $(#node).find("#info")
41
42 animationLength: ->
43 1000

Ok I just solved it myself :P I still don't know why but changing form calling the #accessor method toggleDetails to switchView worked wonders. Does anyone know why?

Is default '\n' line terminator can be changed?

Is default '\n' Matlab line terminator can be changed? Can I use ',' instead of '\n'? Because the serial port that will be reading it is programmed to terminate when ',' is read.Is it possible?
Any answers are highly appreciated! Thanks in advance!

use eg:
ser = serial('COM1'); % establish connection between Matlab and COM1
set(ser, 'Terminator', 'CR'); % set communication string to end on ASCII 13
and replace 'CR' with ','
See
http://www.swarthmore.edu/NatSci/ceverba1/Class/e5/E5MatlabExamples.html
http://www.mathworks.co.uk/help/matlab/matlab_external/terminator.html

A simple string in matlab is not terminated by \n or \0 since it is a simple array of chars, as seen here:
>> a = string('Hello World')
Warning: string is obsolete and will be discontinued.
Use char instead.
a =
Hello World
>> double(a)
ans =
72 101 108 108 111 32 87 111 114 108 100
to add a \0 to the end simply use:
>> a(end+1)=0;
>> a
a =
Hello World
>> double(a) %the zero is there, but not printable as seen here
ans =
72 101 108 108 111 32 87 111 114 108 100 0

Ghostscript postscript pswrite is encoding text

Why is Ghostscript pswrite encoding my text in its output? Consider the following MWE:
%!PS-Adobe-3.0
%%Title: mwe.ps
%%Pages: 001
%%BoundingBox: 0 0 595 842
%%EndComments
%%Page: 1 1
%%PageBoundingBox: 0 0 595 842
0 0 1 setrgbcolor
0 0 595 842 rectfill
1 0 0 setrgbcolor
247 371 100 100 rectfill
/Times-Roman findfont
72 scalefont
setfont
newpath
247 300 moveto
(Chris) show
showpage
Saving this MWE to file and viewing in GSview will display a blue page with red square and my name underneath. Now run this file through Ghostscript 9.06 with the following command line:
"c:\Program Files\gs\gs9.06\bin\gswin64c.exe" ^
-dSAFER -dBATCH -dNOPAUSE ^
-sDEVICE=pswrite -sPAPERSIZE=a4 -r72 -sOutputFile=mwe_gs.ps mwe.ps
See Ghostscript output below. Can someone please explain what is happening here. Whilst the two rectfill commands are still apparent, my text (Chris) has been encoded and is no longer distinguishable.
Is there an alternative postscript device which would retain my text please?
<snip>
%%Page: 1 1
%%PageBoundingBox: 0 0 595 842
%%BeginPageSetup
GS_pswrite_2_0_1001 begin
595 842 /a4 setpagesize
/pagesave save store 197 dict begin
1 1 scale
%%EndPageSetup
gsave mark
255 0 r6
0 0 595 842 rf
255 0 r3
247 371 100 100 rf
Q q
0 0 595 0 0 842 ^ Y
255 0 r3
249 299 43 50 /5D
$C
,6CW56m1G"ZORNkWR*rB:!c2;9rlWTH="2^^[(q"h>cG<omZ2l^=qC[XbO:8_[?kji-8^"N#3q*
jhL~>
,
289 300 41 49 /0P
$C
4r?0p$m<EkK3,0>s8W-!s8W-!s8W,u]<1irI=*p=<t0>_#<)>Is8K6,aTi'$~>
,
325 300 30 33 /5I
$C
49S"pc4+Rhs8W-!s8W)oqdD:saRZq[4+k%):]~>
,
349 300 24 49 /0T
$C
4q%Ms%;PqCs8W-!s8W%1_qkn/K?*sYFSGd:5Q~>
,
377 299 23 34 /5M
$C
-TQR7$&O'!K+D:XribR9;$mr4#sqUi.T#,dX=Y&Llg+F`d^HC#%$"]~>
,
cleartomark end end pagesave restore
showpage
%%PageTrailer
%%Trailer
%%Pages: 1
%%EOF
NOTE: This might seem an odd activity but I'm exploring the idea of using Ghostscript to 'clean up' postscript output from Matlab application..

The 'text' has been converted to images, not vector paths. This is a serious limitation of the pswrite device, and one of the reasons it is deprecated, you should use the ps2write device instead. The only reason the pswrite device is still included at all is for epswrite which uses it (which is why the pswrite and epswrite output looks the same). At some point there will be an eps2write device and pswrite will be binned.
ps2write output is, by default, compressed. If you want uncompressed output, use the -dCompressPages=false switch on the command line.
If all you want is the location of the text you might consider the txtwrite device. The default implementation of this creates a plain text representation of the input, but you can have it output a faked up XML instead which includes things like the origin of the text.

Here is a simple example of the show operator being redefined to display position information about the show, along with performing the standard show operation. With ghostscript you can run multiple files, so the header file would be a prefix to the other file, which alters standard behavior.
The redefined show could have included font name and size. The data could have been written to a disk file, rather than dumped to the console. Any of other operator could have also been redefined, like rectfill, fill, stroke... Because the original operator is also called, you can convert a .ps to .pdf using a pdfwrite device, while at the same time obtaining position information.
gswin32c.exe -dBATCH -dNOPAUSE header.ps trash.ps
gswin32c.exe -sDEVICE=pdfwrite -dCompressPages=false -sOutputFile=test.pdf header.ps trash.ps
output
currentpoint x:247.0 y:300.0 pathbbox 249.015,298.992 400.066,349.184 text:Chris currentrgbcolor:1.0,0.0,0.0( )
currentpoint x:50.0 y:90.0 pathbbox 50.8682,89.2852 181.327,139.184 text:Fred currentrgbcolor:1.0,0.0,0.0( )
currentpoint x:150.0 y:200.0 pathbbox 150.867,184.298 304.154,247.673 text:Mary currentrgbcolor:1.0,0.0,0.0( )
currentpoint x:300.0 y:350.0 pathbbox 300.867,348.993 598.79,398.681 text:Mr. Green currentrgbcolor:0.0,1.0,0.0( )
currentpoint x:100.0 y:400.0 pathbbox 100.866,399.202 358.547,449.183 text:Mr. Blue currentrgbcolor:0.0,0.0,1.0( )
Header.ps
/mydict 5 dict def
mydict begin
/show
{
(currentpoint ) print
currentpoint exch 10 string cvs ( x:) print print 10 string cvs ( y:) print print
gsave dup false charpath flattenpath
( pathbbox ) print
pathbbox
4 -1 roll 10 string cvs print (,) print
3 -1 roll 10 string cvs print ( ) print
2 -1 roll 10 string cvs print (,) print
10 string cvs print ( ) print
grestore
( text:) 10 string cvs print
dup print ( ) print
( currentrgbcolor:) print
currentrgbcolor
3 -1 roll 10 string cvs print (,) print
2 -1 roll 10 string cvs print (,) print
10 string cvs print ( ) ==
systemdict /show get exec
} def
trash.ps
%!PS-Adobe-3.0
%%Title: mwe.ps
%%Pages: 001
%%BoundingBox: 0 0 595 842
%%EndComments
%%Page: 1 1
%%PageBoundingBox: 0 0 595 842
0 0 1 setrgbcolor
0 0 595 842 rectfill
1 0 0 setrgbcolor
247 371 100 100 rectfill
/Times-Roman findfont
72 scalefont
setfont
newpath
247 300 moveto (Chris) show
50 90 moveto (Fred) show
150 200 moveto (Mary) show
0 1 0 setrgbcolor
300 350 moveto (Mr. Green) show
0 0 1 setrgbcolor
100 400 moveto (Mr. Blue) show
showpage

The text has been converted to vector paths. 249 299 43 50 /5D begins the first letter "C", then 289 300 is the "h", 289 300 the "r"....
What pswrite has done is eliminate the need for a font, so while your original code used /Times-Roman, the distilled code doesn't need any font, but rather draws the text using vectors.
I'm not sure exactly what you are after, but you could try "ps2write" or "epswrite" as alternatives to "pswrite". pswrite is used to write to ps level 1 standard and ps2write will write ps level 2 output. Nobody requires ps level 1 anymore, so level 2 would be acceptable. The epswrite will write to encapsulated postscript (eps).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to get Hocr output using python-tesseract - tesseract

Just add hocr at the end of your command like this tesseract input_filename output_filename_base hocr Output file will be a html file

I've edited krozaine's answer: import pytesseract pytesseract.run_and_get_output('image.jpg', 'hocr', lang=None, config="hocr")

Related

import txt file in SAS

Modifying perl script to not print duplicates and extract sequences of a certain length

Data-event-click only called once, why?

Is default '\n' line terminator can be changed?

Ghostscript postscript pswrite is encoding text

Categories

Resources