I use Tesseract version 4 (swedish).
When interpreting images, it can never recognize some special characters like #, § etc.
For example, if I have xxx#gmail.com, it interprets xxxdgmail.com
How can I solve this problem?
Try combining with eng language -- or one that supports your special characters -- such as -l swe+eng or -l eng+swe
Related
I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.
I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:
grep -P "[\x1F6-\x1F1]" contact_names.tokens
grep: range out of order in character class
https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f
You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value.
So the regex should be "[\x{1F1FF}-\x{1F600}]".
The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:
grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]" contact_names.tokens
(The range is borrowed from Suhail Gupta's answer on a similar question)
If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.
You could use ugrep as a drop-in replacement for grep to do this:
ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens
ugrep matches Unicode patterns by default (disabled with option -U).
The regular expression syntax is POSIX ERE compliant, extended with
Unicode character classes, lazy quantifiers, and negative patterns to
skip unwanted pattern matches to produce more precise results.
ugrep searches UTF-encoded input when UTF BOM (byte order mark) are
present and ASCII and UTF-8 when no UTF BOM is present. Option
--encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.
ugrep searches text and binary files and produces hexdumps for binary matches.
The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html
I want to use a couple of non-printable ASCII characters to encode special meanings to text generated by an application (under Linux).
Some non-printable ASCII characters are interpreted to have special meanings in a sheer number of applications, such as LF (linefeed), character number 10 in the ASCII table. I am unsure if there are more of these. Basically, I am looking for a character, where the function of which is outdated and safe to use in the modern world.
Thank you in advance.
How to display 192 character symbol ( └ ) in perl ?
What you want is to be able to print unicode, and the answer is in perldoc perluniintro.
You can use \x{nnnn} where n is the hex identifier, or you can do \N{...} with the name:
perl -E 'say "\x{2514}"; use charnames; say "\N{BOX DRAWINGS LIGHT UP AND RIGHT}"'
To use exactly these codes your terminal must support Code Page 437, which contains frames. Alternatively you can use derived CP850 with less boxing characters.
Such boxing characters also exist as Unicode Block Elements. The char which you want in perl is noted as \N{U+2514}. More details in perlunicode
That looks like the Code page 437 encoding. Perl is probably just outputting bytes that you give it. And your terminal is probably expecting UTF8.
So you need to decode it to Unicode, then re-encode it in UTF-8.
EDIT: Correct encoding.
As usual, Jon Skeet nails it: the 192 code is in the "extended ASCII" range. I suggest you follow #Douglas Leeder's advice, but I'm not sure which encoding www.LookupTables.com is giving you; ISO-8859-1 thinks 192 maps to "À", and Mac OS Roman thinks its "¿".
Is there a solution that works on ALL characters?
The user says they wanted to use an latin-1 extended charset character — so let's try an example from this block! So, if they wanted the Æ character, they would run...
print "\x{00C6}";
Output: �
Full Testable, Online Demo
TDLR Character Encoding Modes in Perl
So, wait, what just happened there? You'll notice that other ways of invoking UTF-8, such as char(...), \N{U+...}, and even unpack(...) also have the same issue. That's right -- the problem isn't with any of these functions, but an underlying character abstraction layer. In this case, you'll want to indicate this layer early in your code..
use open qw( :std :encoding(UTF-8) );
print "\x{00C6}";
Output: Æ
Now I can spell 'Ælf' correctly!
Full Testable, Online Demo
Why did that happen?
There is a note within the PerlDoc regarding the chr() function....
Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.
For this reason, this special block needs to have that special use open to indicate std encoding.
This is a double question for you amazingly kind Stacked Overflow Wizards out there.
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.
It's really easy to do regular expressions on latin text:
(re-seq #"[\w]+" "It's really true that Japanese sentences don't need spaces?")
But what if I had some japanese? I thought that this would work, but I can't test it:
(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:
(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")
Thanks!
Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.
On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:
(re-seq #"\w+" "prøve") =>("pr" "ve") ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große") => ("gro" "e") ; German
(re-seq #"\w+" "plaît") => ("pla" "t") ; French
The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.
But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian
(re-seq #"\p{L}+" "prøve")
=> ("prøve")
as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):
(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")
There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.
Edit: More on Unicode in Java
A quick reference to other points of potential interest when working with Unicode.
Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.
This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).
java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
java.lang.String - lets you specify a charset when creating a String from an array of bytes.
java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.
Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.
When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.
unicode.org - the Unicode standard and code charts.
I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.
I'll answer half a question here:
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?
A more interactive way:
M-x customize-group
"slime-lisp"
Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.
Or place this in your .emacs:
(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))
That's what the interactive menu will do anyway.
Works on Emacs 23 and works on my machine
For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:
user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当?")
("スペース")
Hiragana, for what it's worth:
user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当?")
("の" "には" "が" "ないって")
I'd be pretty amazed if any regex could detect Japanese word breaks.
for international characters you need to use Java Character classes, something like [\p{javaLowerCase}\p{javaUpperCase}]+ to match any word character... \w is used for ASCII - see java.util.Regex documentation
Prefix your regex with (?U) like so: (re-matches #"(?U)\w+" "ñé2_hi") => "ñé2_hi".
This sets the UNICODE_CHARACTER_CLASS flag to true so that the typical character classes do what you want with non-ASCII Unicode.
See here for more info: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS