Tesseract OCR line breaks on command line

Tesseract OCR line breaks on command line - tesseract

I am using tesseract.exe in Windows 7 by command line and while scanning image for OCR, I get output in continuous lines. I want it in the word wrap exactly the way it is in image. Is there a command line argument for such variations? Any help will be appreciated.

This is because Tesseract puts just line feeds at the end of a line instead of carriage returns + line feeds as expected by Windows' Notepad. An easy workaround is to output the results to stdout and redirect this output into a file:
tesseract.exe eurotext.tif - > result.txt
instead of
tesseract.exe eurotext.tif result

Related

SED - using $ inserts string at beginning of line instead of end

I am not getting expected results from sed 's/$/2021-07-21/' demotoytable.csv
Before the command the top 3 lines look like:
urlhm|main_code|description|taxable|itemnum|xtras
t3mr.com/guitar/qrc/G19RTE000000753|G19RTE0000007530|Promo_labor_day_006|Consignment|7522831|bag
t3mr.com/guitar/qrc/G19RTE000000753|G19RTE0000007530|Promo_labor_day_006|Consignment|7522835|box
t3mr.com/guitar/qrc/G19RTE000000753|G19RTE0000007530|Promo_labor_day_006|Consignment|7522839|case
But after running the command sed 's/$/|2021-07-21/' demotoytable.csv
I get this result:
|2021-07-21code|description|taxable|itemnum|xtras
|2021-07-21itar/qrc/G19RTE000000753|G19RTE0000007530|Promo_labor_day_006|Consignment|7522831|bag
|2021-07-21itar/qrc/G19RTE000000753|G19RTE0000007530|Promo_labor_day_006|Consignment|7522835|box
|2021-07-21itar/qrc/G19RTE000000753|G19RTE0000007530|Promo_labor_day_006|Consignment|7522839|case
Any ideas on why this is happening, or better yet how to fix? I want each line to end w "|2021-07-21", not begin with it. On a Mac Pro running Big Sur
Thanks

Remove carriage returns and then add the texts you wish to add:
sed 's/\r$//; s/$/|2021-07-21/' demotoytable.csv
s/\r$// removes carriage returns at the end of lines, s/$/|2021-07-21/ in its turn appends the value of your choice at the end of lines.

Redirect C file output to a text file

I need to redirect my C program output to a text file with the command line.
I need a command that automatically puts a C program output to a text file, and a command that takes the input to a C program from a text file.
Can someone help please?

Basically, you need "input and output redirection".
Let's say you have a code that expects some output from user (you have fgets()in your code - just an example), you want program to read from file, instead of waiting for user input on cmd...
you would call it: program.exe < input.txt
Similarly, for output, you want printf() to write to file, instead of command prompt, you would do
program.exe > output.txt
To combine both in one line
program.exe < input.txt > output.txt

Remove '^#' line from a file

I have a file in which I have a particular line of this type:
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^ ...
Actually all the others lines are a list (a matrix) of numbers or *******. The problem is that I can not be able to open this file with normal editors and so I can not be able to remove this line.
I can open the file via shell using nano.
To eliminate this line (that is the second line from the top) I used the simple command:
sed '2d' fort.21.dat
But I can not be able to eliminate it.
Can someone help me to eliminate this line and make this file.dat normally readable ?
Thanks a lot

Try:
tr -d '\0' < fort.21.dat > fixed.21.dat
This uses the tr utility to delete the ^# (zero) bytes from the file.

how to increase the input line length(max) in windows?

in my one batch file i set a variable,
after running this batch file it shows the error
"The input line is too long."
how to overcome from this problem.
or it is possible to increase the input line length(max) in windows?

As per my comment, to delimit a line in a batch file, append the ^ character to the line. Eg:
somelong command ^
carries on here ^
and finally ends here
This will behave as 1 line.
Not however if this will overcome the input length limitation.

Very basic replace using sed

Really would appreciate help on this.
I am using sed to create a CSV file. Essentially multiple html files are all merged to a single html file and sed is then used to remove all the junk pictures etc to get to the raw columnar data.
I have all this working but am stuck on the last bit.
What I want to do is very basic - I want to replace the following lines:
"a variable string"
"end td"
"begin td"
with a single line:
"a variable string"
(with a tab character at the end of this line)
I'M USING DOS.
As you see I'm new to all this. If I could get this working would save me a lot of time in the future so would appreciate the help.
At the moment I have to inject some html headers back into the text file, open it in a html editor, select the table and then paste this into a spreadsheet which is a bit of pain.
P.S. is there an easy way to get sed to remove the parenthesis '(' and ')' from a given line?

I doubt that this is what you really want, but it's what you asked for.
sed "s/\"a variable string\"/&\t/; s/\"end td\"//; s/\"begin td\"//" inputfile
What you probably want to do is replace them when they appear consecutively. Here's how you might do that:
sed "1{N;N}; /\"a variable string\"\n\"end td\"\n\"begin td\"/ s/\n.*$/\t/;ta;bb;:a;N;N;:b;$!P;N;D" inputfile
This will remove all parentheses in a file:
sed "s/[()]//g" inputfile
To select particular lines, you could do something like this:
sed "/foo/ s/[()]//g" inputfile
which will only make the replacement if the word "foo" is somewhere on a line.
Edit: Changed single quotes to double quotes to accommodate GNUWin32 and CMD.EXE.

A previous comment I left doesn't appear to have been saved - so will try again
The code to remove the ( and ) worked perfectly thanks
You are right - I was looking to merge the 3 lines into one line so the second example you gave where it looks like its reading the next two lines into the pattern space looks more promising. The output wasn't what I was expecting however.
I now realize the code is going to have to be more complicated and I don't want to trouble you any more as my manual method of injecting some html code back into the text file and opening it up in Openoffice and pasting into a spreadsheet only takes a few seconds and I have a feeling to manually produce the sed coding to this would be a nightmare.
Essentially the rules for converting the html would need to be:
[each tag has been formatted so it appears on its own line]
I have given example of an input file and desired output file below for reference
1) if < tr > is followed by < td > on the next line completely remove the < tr > and < td > lines [i.e. do not output a carriage return] and on the NEXT line stick a " at the start of that line [it doesn't matter about a carriage return at the end of this line as it is going to be edited later]
2) if < /td > is followed by < td > completely remove both these two lines [again do not output a carriage return after these lines] and on the PREVIOUS line output a ", [do not output a carriage return] and on the NEXT line stick "at the start of the line [don't worry about the the ending carriage return is will be edited later]
3) if < /td > is followed by < /tr > delete both of these lines and on the previous line add a " at to the end of the line and a final carriage return.
I have given an example of what the input and desired output would be:
input: http://medinfo.redirectme.net/input.txt
[the wanted file will be posted in the next message - this board will not allow new users to post a message with more than one hyperlink!]
there is an added issue that the address column is on multiple lines on the input file - this could be reduced to one line by looking to see if the first character of the NEXT line is a " If it isn't then do not output the carriage return at the end of the current line
Phew that was a nightmare just to type out never mind actually code. But thanks again for all your help in getting this far!
:-)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Tesseract OCR line breaks on command line - tesseract

I am using tesseract.exe in Windows 7 by command line and while scanning image for OCR, I get output in continuous lines. I want it in the word wrap exactly the way it is in image. Is there a command line argument for such variations? Any help will be appreciated.

Related

SED - using $ inserts string at beginning of line instead of end

Redirect C file output to a text file

Remove '^#' line from a file

how to increase the input line length(max) in windows?

Very basic replace using sed

Categories

Resources