Tesseract: Short words are not detected - tesseract

I'm trying to detect names with Tesseract 3.02.
The detection works perfectly without any error on long names like "Peggy Scott-Adams" and "Jonathan Miller".
However, if the names are too short, like "Jim Lee" the ocr detection won't work any more.
Note how the image looks like:
The first name stands in the first line, the last name stands in the second line. Nothing else on the image.
What can I do?

Related

Error using text2image Font Exocet Light failed with 223518 hits = 99.94% when trying to build image file using Diablo 2 font

I am running tesseract on windows 11 using the command prompt.
The text file is my training data. Words that I want to turn into images.
The output is the next step in the Tesseract process for training my font.
I am saying find fonts but I only have one font in the folder.
text2image --text="C:\PythonProjects\DiabloTesseractTrainFont\text.txt" --outputbase="C:\PythonProjects\DiabloTesseractTrainFont\Output\Dia.font.exp0" --fontconfig_tmpdir="C:\PythonProjects\DiabloTesseractTrainFont" --find_fonts --fonts_dir="C:\PythonProjects\DiabloTesseractTrainFont\Diablo Fonts"
The result:
Total chars = 223645
Font Exocet Light failed with 223518 hits = 99.94%
Not sure why it fails. I have built something similar to this before. I have tried with a font file that I know has worked and it does the exact same thing.
Any help would be appreciated.
I solved it. In the text file, there were some characters that had been changed when I read them into python. I believe they used to be bullet points but when I read the file I had implemented in python ASCII encoding and ignore errors. I figured that those characters would be removed. I was wrong. Those bullet points were replaced with text that said PAD. I found it in notepad++ and highlighted one of them and then replaced them with a space. Note in Notepad++ when I did the replace it did not have anything in the find field but it still replaced all of them. Now it compiles just fine. I was stuck for many hours I hope this helps someone.

tesseract output is in single line instead of multiple lines

i tried to use tesseract for ocr and the recognation is fine.
i want to recognize adresses from letter. when i read it in the following happens:
input:
Name Name
Street
Code City
output:
Name NameStreetCode City
i tried all -psm variaties with no effect. after googling i think -psm 4 would be the right one, but i get an error:
`set_count == gridheight():Error:Assert failed:in file ..\..\textord\colfind.cpp, on line 648`
This effect comes only on windows - on my macbook the lines are correct.
can anybody help me?
Use Unix2dos to convert the file into the correct format.

ORSSerialPort data that should be a single line is coming in over multiple lines

I have an arduino that spits out a single line of GPS data down the serial line every half second, which I know works because I can look at the serial monitor in the arduino IDE and every half second, a new single line of data appears.
I'm now in the process of writing a Mac program using Swift that puts each coordinate on a map as it comes in through the serial port, and am using the ORSSerialPort library to connect to the arduino and receive its data. This works fine and I had a basic version working earlier, however I noticed that there were gaps in the GPS data (they were appearing in small groups on the map, with a noticeable space in between when it should be a constant line of them).
Before I had the map I had a text field that would have each GPS data line added to it as it came in, which produced the exact same output as the arduino IDE serial monitor, so I thought everything was working fine.
To try and fix the problem with the map I removed the map code and simply print()ed out each line into the XCode IDE as it came in through the serial port. To my surprise there were random line breaks in the data and I don't understand why. I feel that this may be causing the problems I am having (with splitting the string at every comma so I can extract the individual values) so would like to know why it comes out as a single line in the arduino IDE and the text field, but not in the XCode IDE and presumably whenever else I am working with the string.
EDIT: I prefixed the print to XCode IDE and the output to the text field with five plusses and suffixed them with five dashes, then set the serial port to close after sending a single report (what should be a single line of data). The output I got to both things ended up being three lines, each prefixed and suffixed with the plusses and dashes. See the photo below, which shows what should be a single line:
Why are my single lines of data coming through over multiple lines and behaving like individual variables (as in getting the last character of the line returns the last character of the first line of the three, not a semi colon)?
The issue isn't likely that there are extra newlines being inserted. Rather, ORSSerialPort (like the underlying POSIX API it uses) simply reports data to its delegate as it comes in. It has no way of knowing that for your particular use case you only want complete lines.
You need to buffer the incoming data and only process it when you've received a complete "line"/packet. ORSSerialPort includes an API, ORSSerialPacketDescriptor that makes this easier. There is further documentation for that API here: https://github.com/armadsen/ORSSerialPort/wiki/Packet-Parsing-API
Do note that this API doesn't (yet) support the use of a end delimiter only. You need to validate the entire packet beginning to end, as the parsing routine is "lazy". That is, it tries to find the smallest match possible starting from the end of the packet.

Defining what is a line in Tesseract

I'm working on document recognition for scanned bank statement. The statements that I have are organized by lines, such as the one attached. Because Tesseract does such a good job at detecting the areas of text, it breaks the lines in the middle (I'm assuming this is because of the large white space between the first block in the line (blurred for privacy reason), and the next one ('EUR', or 'COURS').
In the hocr file, the bbox of all the elements in the line are within 2px or so, so I could potentially rebuild a line myself. However, this seems more like a hack. Is there a way to tell Tesseract that lines should be as wide as the document itself? Or would there be another way to go about it? I've tried playing with the psm option, but with no luck.
-psm 6 -- Assume a single uniform block of text -- should work. If not, you may want to use the older version 2.0x, which does not perform page layout analysis.

Editing a Text File from Matlab

I'm still getting used to Matlab, and not sure if this is possible using Matlab or not, but it's just something that popped into my head that I thought could be interesting.
Is there any way to edit the contents of a text file in Matlab?
Moreover, is there any way to edit specific parts of the text file without altering the rest?
To elaborate, let's say I had a text file that was several lines long. For instance:
This is a hypothetical text file.
The cat chased a mouse.
The mouse ran into a hole.
The cat tried to paw at the mouse.
The mouse waited in the hole until the cat got bored.
The mouse came back out when the cat left.
Is there any way to use Matlab to exclusively edit, say, line 6 and change it from "The mouse waited in the hole until the cat got bored" to "The mouse fell asleep and the cat got bored", without having to change the rest of the file?
I know of several methods to read and display contents of text files using Matlab, but I'm not sure if there's any way to actually edit the text files in Matlab.
Thanks!
As far as I know, you will always have to read the file line by line (for instance into a cell-array) and edit it as you need. After that, you write a new file or overwrite the old one.
Of course, you can encapsulate this procedure and then call you own function like
manipulateFile(lineNumber, newLineText)
Some commands that may come in handy are fopen, fscanf, textread, fprintf, and fclose.