i'm doing ocr of a bunch of old newspaper. Tesseract is doing a really good, but there's only one problem:
is detecting almost (99%) "o" characters as "º".
If a make a whitelist character list excluding the "º" Tesseract doesn't replace with the most identical one (maybe the "o"), instead of that it only reject the recognition. So the word "Hola" become recognized has "Hla".
So do you know a configuration string to replace all "º" with "o" character?
I can do to the txt output simply with sed, but i need it for a PDF output.
Thanks in advance
Try replacing the unicharambigs file embedded in .traineddata file for Tesseract 3.03–3.05.
https://github.com/tesseract-ocr/tessdoc/blob/master/Training-Tesseract-3.03%E2%80%933.05.md#the-unicharambigs-file
Related
I have been researching this for quite some time but cannot seem to find an answer. Perhaps someone here can help.
I am trying to use sed to replace words in yml / yaml files. Since some of the words are included in the names I want to only replace words that appear after the colon (':').
For example. If the .yml file includes:
en:
label_some_tracker: A tracker
label_all_tracker: All trackers
label_attachment_type_trackers: Select trackers.
tracker_plural: trackers
and I want to replace all occurrences of tracker with issue in all values. The pattern:
s/tracker/issue/
also changes the names of the fields, which breaks my code.
I can reduce the size of the problem somewhat by including terms for all possible variants of a word. For example:
s/trackers/issues/
s/tracker/issue/
but that doesn't deal with all situations.
I have tried inserting a space before the search term:
s/ tracker/ issue/
but that matches names where the search term is at the beginning of the line.
If I search for whole words then it still seems to pick up the names because ':' and '_' are 'non word' characters.
If I try to put spaces at the beginning and end of the search term but then it misses words that are at the end of a line or words patterns with punctuation marks before the training space.
The only sure way seems to be to only replace words after a colon (':') but I cannot seem to figure out how to do that with sed.
Does anyone here know how?
With GNU sed:
sed -E 's/(:.*)tracker/\1issue/g' file
Output:
en:
label_some_tracker: A issue
label_all_tracker: All issues
label_attachment_type_trackers: Select issues.
tracker_plural: issues
Replace second occurance:
sed 's/tracker/issue/2' file
I'm trying to accomplish following task in Octave:
Read filename from text file
Search for this file in particular location on hard drive
My script works for most files, but for certain files containing unicode characters I'm unable to match the filename from textfile with filename as it appears in the file system.
Filenames in textfile are in UTF-8 encoding and I read them in Octave with function fgetl().
Filenames from file system are obtained via function readdir(). I'm on Windows, NTFS file system.
For example, one problematic filename contains character "Č".
When printed out in Octave console, the characters appear exactly the same. However, a HEX viewer reveals that the characters are not actually the same. In the first case the character is encoded as 0x010C, in the second case as 0x0043 + 0x030C. Comparing both of them via strcmp() fails, of course.
What I tried to do is to omitt all non-ASCII characters from the filename and then compare them. But this didn't work, probably because in the second variant the first part of the character (0x0043) is actually ASCII.
Now I'm looking for some way of converting one format to another to be able to compare them. Any ideas?
EDIT:
As I discovered later, the character Č in the filename on Windows is actually written as C+ˇ, which is just another way you can write that character. So the difference probably insn't in encoding standard, but in 2 different ways to achieve 1 visible character (glyph).
This question basically then changes to a task of matching characters written "at once" and corresponding pair of letter+combining character.
I copy text from PDF files into word 2010 documents using Abbyy conversion software. I find the result will contain many line breaks which are incorrect. Is there any way I can remove any such marks if they are not preceded by either "." or "?" or "!"
I write macros in excel but have no experience of word coding
You could do a search and replace depending if you can find some sort of rules wich you can apply. Mayeby a little screenshot?
My CSS files have become contaminated with "file separator" characters (AKA "INFORMATION SEPARATOR FOUR" or ALT/028 characters). How can I get rid of them?
This is the character:
http://www.fileformat.info/info/unicode/char/1c/index.htm
Background
I manage a number of .CSS text files that are fairly similar. Unfortunately a number of these file have somehow got "file separator" characters pasted into them. Although they do still seem to work in browsers any file that has one of these characters anywhere within it can not be indexed by my desktop search utility (X1 Search). And this is making them extremely hard for me to manage because I need to compare CSS files contantly.
[Bizarrely X1 Search ignores the character if the filename extension is .TXT but files to index the entire file if the filename extension is .CSS]
Worse this "file separator" character is almost invisible within my text editor (TextPad 7.2). The only way I can detect it is to make spaces and carriage returns visible and then it appears as blank space. Worse still it appears to be impossible to search for using text search.
To make it clear what I mean an example that I have pasted into this page. The "file separator" character is on LineB below
LineA
LineB
LineC
LineD
Is there any way to remove this character from multiple text (in this case CSS) files at once?
NB I do NOT want to remove the whole line, just the one character(!)
Thanks
J
P.S. I am running on Windows7 (x64). I am using TextPad 7.3.
I have eventually managed to answer my own question.
Text Crawler and the use of a regular expression of "\x1c" appears to be the answer.
Fwiw, both Agent Ransack and FileLocator Pro filter out any characters in the ASCII range 0-31 (excluding 0x09 - tab) from the input field.
i am trying to find a clean way to decode some "special characters" using php, i have an RTF file (aslo PDF and DOC with sane kind of issue), i manage to open it and find clear text in it, put at the end it still output some character like : é for é or ç for ç.
I have tried mb_detect_encoding (with "auto" also) but it detect "ACSII", i have try to convert using mb_convert_encoding($mytext,'ISO-8859-1'), mb_convert_encoding($mytext,'ISO-8859-15'),
mb_convert_encoding($mytext,'UTF-8') and then UTF-8 to ISO-8859-1, htmlspecialchars, utf8_decode (recursively).
I made a mapping table but i don't think it the best way to do it ?
Xavier VILAIN
PS : Most of the document have been create in france latin charset.