Compare Chinese unicode strings, when multiple code points are the same character? - unicode

I'm writing some Java code that deals with Chinese characters, and I got some unexpected results -- strings that should be equal were not. Here is one of the offending characters, which means "six" (pinyin: liù): 六. This character can be represented with either of two code points:
F9D1 in the block: CJK Compatibility Ideographs
516D in the block: CJK Unified Ideographs
Wikipedia has a page about these character ranges, and the short section on compatibility ideographs does mention some duplicates, but the list omits this specific character.
So I'm wondering:
Is there a list of duplicate unicode characters somewhere so I can transform Strings before trying to compare them?
Is this normal when dealing with CJK characters, or have I done something else wrong?

Just normalize them. U+F9D1 becomes U+516D under any of the four normalization schemes:
$ export PERL_UNICODE=S
$ perl -le 'print "\x{F9D1}\x{516D}"' | uniquote -v
\N{CJK COMPATIBILITY IDEOGRAPH-F9D1}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfd | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfc | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfkd | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfkc | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
Many essential Unicode tools, including those, are available here.

Related

command line filtering of Unicode block

I've been trying for a couple hours to create a conceptually trivial filter that I can use on the command line, without success. The task is to filter out all lines containing Hangul Jamo characters, while retaining all other lines (which may contain ASCII, characters in the Hangul Syllable block, etc.).
So for example if the input was
foo
ᅤᆨ
간
the output would contain the first and third lines, but not the second, since the second line contains Jamo characters. (The above is not meant to be real Korean, just a simple test case.)
I'm very disappointed with the Gnu grep utility (version 2.20). I would have thought the ff. would work:
grep -Pv '[\x{1100}-\x{11FF}]'
but instead I get the error message grep: character value in \x{...} sequence is too large. (The \u1100 syntax, which is the actual Perl syntax, simply isn't supported.)
(I do notice that our version 2.20 is rather old. If someone tries the above with a newer version of grep, and it works, I'll certainly consider that an answer--and I'll get our IT folks to upgrade!)
I tried sed, but didn't get any further. (Sorry, I don't remember exactly what sed commands I tried, but sed's support for Unicode blocks doesn't seem any better than grep's.)
Finally, I tried perl (v5.16.3):
perl -ne 'print unless /[\u1100-\u11ff]/'
This at least succeeds in eliminating the Jamo lines while retaining the Hangul Syllable lines, but it also eliminates the ASCII lines, which I don't want to do. I also would have thought one of the ff. would work:
perl -ne 'print unless /\p{InHangul_Jamo}/'
perl -ne 'print unless /\p{Block: Hangul_Jamo}/'
but neither appears to have any effect. (Afaik, I shouldn't have to have a .* on each side of the \p{...}, but I tried that too; no luck.)
Locale: in case it matters, I have LANG=en_US.UTF-8.
I'm sure I could do this in Python, but I'd like to understand why neither grep nor perl seems to work, because they'd be a lot simpler. (And if I'm right about the Gnu utilities having poor Unicode support, why that is...and when it will be fixed. It's not like Unicode is new!) Of course I realize the problem may be that I'm not holding my mouth right when I try, but if so, it would be nice for grep at least to have better documentation on Unicode usage. Right now the documentation for grep -P says "This is highly experimental and grep -P may warn of unimplemented features." And it seems to have been that way roughly forever.
Decode inputs, encode outputs. If the encoding in question is UTF-8, the command-line switch -CSD will come in useful.
perl -CSD -ne'print if !/\p{Block: Hangul_Jamo}/'
perl -CSD -ne'print if !/\p{Block: Jamo}/'
perl -CSD -ne'print if !/\p{Blk=Jamo}/'
perl -CSD -ne'print if !/\p{InJamo}/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}]/'
grep -vP '[\x{1100}-\x{11FF}]'
You might want to add the Hangul_Jamo_Extended_A, Hangul_Jamo_Extended_B and Hangul_Compatibility_Jamo blocks.
perl -CSD -ne'print if !/[\p{Block: Hangul_Jamo}\p{Block: Hangul_Jamo_Extended_A}\p{Block: Hangul_Jamo_Extended_B}\p{Block: Hangul_Compatibility_Jamo}]/'
perl -CSD -ne'print if !/[\p{Block: Jamo}\p{Block: JamoExtA}\p{Block: JamoExtB}\p{Block: CompatJamo}]/'
perl -CSD -ne'print if !/[\p{Blk=Jamo}\p{Blk=JamoExtA}\p{Blk=JamoExtB}\p{Blk=CompatJamo}]/'
perl -CSD -ne'print if !/[\p{InJamo}\p{InJamoExtA}\p{InJamoExtB}\p{InCompatJamo}]/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}\N{U+A960}-\N{U+A97F}\N{U+D7B0}-\N{U+D7FF}\N{U+3130}-\N{U+318F}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]/'
grep -vP '[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]'
Let's look at your failed attempts.
grep -Pv '[\x{1100}-\x{11FF}]'
Actually, this one should work, and it does for me.
$ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | od -t x1
0000000 61 62 63 0a 64 e1 84 80 66 0a 67 68 69 0a
0000016
$ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | grep -Pv '[\x{1100}-\x{11FF}]'
abc
ghi
$ grep --version | head -1
grep (GNU grep) 2.16
I do get your error on an older machine with grep (GNU grep) 2.10.
perl -ne'print unless /\p{Block: Hangul_Jamo}/'
You didn't get any matches from /\p{Block: Hangul_Jamo}/ because you were matching against encoded text (UTF-8 bytes, chars in the range 00..FF) instead of decoded text (Unicode Code Points, chars in the range 00000..10FFFF).
perl -ne 'print unless /\p{InHangul_Jamo}/'
\p{Block: X}, \p{Blk=X} and \p{InX} are equivalent.
perl -ne'print unless /[\x{1100}-\x{11FF}]/'
[\x{1100}-\x{11FF}] is equivalent to \p{Block: Hangul_Jamo}.
perl -ne'print unless /[\u1100-\u11ff]/'
You got too many matches since \u in double-quoted string literals and in regex pattern literals titlecases the next character. (e.g. "\uxyx" is equivalent to "Xyz".)
As such, [\u1100-\u11ff] is equivalent to [01f].
for what it's worth, this is my own jamo filter in gnu-grep :
noJamo is an alias for
ggrep -vP '[\x{1100}-\x{11FF}
\x{A960}-\x{A97F}
\x{D7B0}-\x{D7FF}
\x{3130}-\x{318F}]'
However, if you only care about the core Jamo set that maps to 11,172 syllables, and don't mind using something other than grep, then this should be extremely fast :
\341\204[\200-\222]|
\341\205[\241-\265]|
\341\206[\250-\277]|\341\207[\200-\202]
if you add up the octals in each line, they're exactly 19 cho in row 1, 21 jung in row 2, and 28 jong in row 3.
I did a quick benchmark with a synthetic 5.55 GB .txt file containing lines that add up to some 4.3 GB.
And this regex's filtering throughput was some 1.55 GB/sec, practically at the limit of my SSD I/O.
(time (pvE0 < jamotest000001.txt|
mawk2 'BEGIN{ FS=ORS }
/\341(\204[\200-\222]|
\205[\241-\265]|
\206[\250-\277]|
\207[\200-\202] )/'
| pvE9 | xxh128sum))| ecp;
in0: 5.55GiB 0:00:03 [1.55GiB/s] [1.55GiB/s]
[=================>] 100%
out9: 4.29GiB 0:00:03 [1.20GiB/s] [1.20GiB/s]
[ <=> ]
( pvE 0.1 in0 < jamotest000001.txt | mawk2 | pvE 0.1 out9 | xxh128sum; )
3.70s user 2.73s system 178% cpu 3.597 total
f4ef119214a3c39c7c560ad24491b96c stdin

Perl regex replacement of logical unicode characters

Here is a simple substitution that adds parentheses arounds upper-case characters in an unicode string. As you can see, the result is rather ugly:
~$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
My understanding is that the regex operates on "code points" instead of "logical characters", which splits my 'é' into meaningless characters. Is there a way to force the regex to work on logical unicode characters at once ?
Thanks,
As illustrated by the other answers, turning on UTF-8 in Perl is a piecemeal process. There's use utf8 for the syntax and raw strings. Then you have to make sure all your filehandles are UTF-8. What about #ARGV? readdir? glob? The output from ``?
There's nothing worse than having half your program working in ASCII and the other half working in UTF-8. utf8::all to the rescue!
Install it, add use utf8::all, and it will turn on UTF-8... all of it. Someone else figured it out, you don't have to worry about it.
$ echo "Whatéver 5" | perl -ape "use utf8::all; s/(\p{Upper})/(\1)/g"
(W)hatéver 5
You haven't told Perl to expect UTF-8 input, so it is treating each byte of the encoding as a separate character
Within a program you can set the default encoding for the three standard IO channels like this
use open ':std' => ':encoding(UTF-8)'
On the command line, the option -CS does the same thing, so this should work for you. I have removed the unnecessary autosplit option and replaced \1 with the correct $1 in the replacement string
echo "Whatéver 5" | perl -CS -pe "s/(\p{Upper})/($1)/g"
Assuming that your terminal uses UTF-8 encoding,
$ echo -n "é" | perl -ne 'printf "%vX\n", $_'
gives
C3.A9
so the input to the Perl program has not been converted internally to Unicode (it is still a string of UTF-8 bytes)
To convert the input to a Perl string, add a UTF-8 layer on the standard input stream using option -CI :
$ echo -n "é" | perl -CI -ne 'printf "%vX\n", $_'
the output is now
E9
However, if you also try to print the character back to standard output
you will not get é but a unicode replacement character � from the terminal. This is because the character 0xE9 is Unicode, but the terminal expect UTF-8, and 0xE9 is not valid UTF-8:
$ echo -n "é" | perl -CI -nE 'printf "$_: %vX\n", $_, $_'
�: E9
To get correct output, you can add an UFT-8 encoding layer on the standard output stream also (using -CO flag):
$ echo -n "é" | perl -CIO -nE 'printf "$_: %vX\n", $_, $_'
é: E9
According to perlunicode
"Upper" is a synonym for "Uppercase" , and we could have written
\p{Uppercase} equivalently as \p{Upper}
and
For instance, \p{Uppercase} matches any single character with the
Unicode "Uppercase" property
It seems like if you try to use \p{Upper} on a byte string, you will not get any warnings from Perl. Also bytes in the range 0xC0 to 0xDE will match the uppercase property. Try
perl -E 'for $i (0x80..0xFF) {$_=chr $i; printf "%x\n", $i if /\p{Upper}/}'
This explains the output you got:
$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
Here, the letter é is represented as 2 bytes (in UTF-8) 0xC3 and 0xA9, and 0xC3 will match the Unicode Upper property.
A solution to your problem is therefore to add UTF-8 encoding layers on the standard input and output (you can combine -CI and -CO using -CS):
echo "Whatéver 5" | perl -CS -ape "s/(\p{Upper})/(\1)/g"
with output:
(W)hatéver 5

Is there a way to run a tr command (or something like it) in Windows Perl?

I have a line of shell script that I need to recreate the functionality of in Windows Perl?
tr -cd '[:print:]\n\r' | tr -s ' ' | sed -e 's/ $//'
The sed part is easy but not being a shell scripting expert (or perl expert for that matter) I'm having trouble recreating the functionality of the tr command in Perl.
Break down the bits to what each does and convert that to Perl:
tr -cd '[:print:]\n\r'
Take the complement (-c) of printable characters, newlines, and carriage returns and delete them (-d). The tr manpage tells you which each of those do.
tr -s ' '
Collapse multiple same characters (-s) to one character.
sed -e 's/ $//'
Get rid of a trailing space.
Now just do that in whatever language you want to use. Putting it together, you might like something like:
perl -pe 's/\P{PosixPrint}//g; tr/ //s; s/ \z//;'
Note that Perl's tr does not do character classes, but I can use a complemented (\P{...} with the uppercase P) Unicode character class to do the same thing. Perl character classes also understand the regular POSIX character classes.

(e)grep: accented characters not recognised as part of a word

I would like to use (e)grep to match a whole word using the -w switch. I've set the locale, but accented characters are being treated as word boundaries as in this example:
$ locale
LANG=es_VE.utf8
LC_CTYPE="es_VE.utf8"
LC_NUMERIC="es_VE.utf8"
LC_TIME="es_VE.utf8"
LC_COLLATE="es_VE.utf8"
LC_MONETARY="es_VE.utf8"
LC_MESSAGES="es_VE.utf8"
LC_ALL=es_VE.utf8
$ echo -e "cáñamo\namo" | egrep -w amo
cáñamo
amo
How can I find amo while ignoring cáñamo
Which code points count as a word-class character is not locale-dependent in Unicode, and LATIN SMALL LETTER N WITH TILDE is always a word character.
Here’s an all-UTF8 workflow demonstrating searching for amo after a word boundary, and after a non-(word-boundary):
$ perl -Mutf8 -CSDA -e 'print "cáñamo\namo\n"' |
perl -Mutf8 -CSDA -ne 'print if /\bamo\b/'
amo
$ perl -Mutf8 -CSDA -e 'print "cáñamo\namo\n"' |
perl -Mutf8 -CSDA -ne 'print if /\Bamo\b/'
cáñamo
I cannot help but be amused by your choice of search strings. Thanks for the chuckle.

Finding non-Ascii character [duplicate]

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented
features.
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
So the first solution for instance would become:
grep --color='auto' -P -n '[^\x00-\x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:
pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml
Any pros or cons that anyone can think off?
The following works for me:
grep -P "[\x80-\xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of \xdd escapes in character classes to accomplish what you want.
The easy way is to define a non-ASCII character... as a character that is not an ASCII character.
LC_ALL=C grep '[^ -~]' file.xml
Add a tab after the ^ if necessary.
Setting LC_COLLATE=C avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C avoids locale-dependent effects altogether.
In perl
perl -ane '{ if(m/[[:^ascii:]]/) { print } }' fileName > newFile
Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF] in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:
grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
Note: my computer's grep (a Mac) did not have -P option, so I did brew install grep and started the call above with ggrep instead of grep.
Searching for non-printable chars. TLDR; Executive Summary
search for control chars AND extended unicode
locale setting e.g. LC_ALL=C needed to make grep do what you might expect with extended unicode
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
. . more . . excruciating detail on this: . . .
I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"
Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.
I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:
grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
ACTUALLY, generally you will need to do this:
LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
breakdown:
LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches
E.g. practical example of use find to grep all files under current directory:
LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +
You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.
Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes
Dec Hex Ctrl Char description Dec Hex Ctrl Char description
0 00 ^# NULL 16 10 ^P DATA LINK ESCAPE (DLE)
1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1)
2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2)
3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3)
4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4)
5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK)
6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN)
7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB)
8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN)
9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM)
10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB)
11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC)
12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW
13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW
14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW
15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW
UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. #frabjous asked and #calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.
e.g. my locale LC_ALL= empty
$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=
grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:
$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call underscore c2a0
9:CTRL
31:5 © copyright
32:7 call underscore
grep with LC_ALL=C does seem to match all extended characters that you would want:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
1 ‐‐ unicode dashes e28090
3 💘 Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call underscore c2a0
9 CTRL-H CHARS URK URK URK
11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3 💘 Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call underscore
33 11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
34 52 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
73 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
The following code works:
find /tmp | perl -ne 'print if /[^[:ascii:]]/'
Replace /tmp with the name of the directory you want to search through.
This method should work with any POSIX-compliant version of awk and iconv.
We can take advantage of file and tr as well.
curl is not POSIX, of course.
Solutions above may be better in some cases, but they seem to depend on GNU/Linux implementations or additional tools.
Just get a sample file somehow:
$ curl -LOs http://gutenberg.org/files/84/84-0.txt
$ file 84-0.txt
84-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
Search for UTF-8 characters:
$ awk '/[\x80-\xFF]/ { print }' 84-0.txt
or non-ASCII
$ awk '/[^[:ascii:]]/ { print }' 84-0.txt
Convert UTF-8 to ASCII, removing problematic characters (including BOM which should not be in UTF-8 anyway):
$ iconv -c -t ASCII 84-0.txt > 84-ascii.txt
Check it:
$ file 84-ascii.txt
84-ascii.txt: ASCII text, with CRLF line terminators
Tweak it to remove DOS line endings / ^M ("CRLF line terminators"):
$ tr -d '\015' < 84-ascii.txt > 84-tweaked.txt && file 84-tweaked.txt
84-tweaked.txt: ASCII text
This method discards any "bad" characters it cannot deal with, so you may need to sanitize / validate the output. YMMV
Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:
cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'
For unicode characters (like \u2212 in example below) use this:
find . ... -exec perl -CA -e '$ARGV = #ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8
grep -v $'\u200d'
Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.
For the former, try one of these (variable file is used for automation):
file=file.txt ; LC_ALL=C grep -Piao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[^\x00-\x19\x21-\x7F]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.
ASCII range is x00-x7F, space is x20, since strings have spaces the negative range omits it.
Non-ASCII range is x80-xFF, since strings have spaces the positive range adds it.
String is presumed to be at least 7 consecutive characters within the range. {7,}.
For shell readable output, uchardet $file returns a guess of the file encoding which is passed to iconv for automatic interpolation.
if you're trying to grab/grep UTF8-compliant multibyte-characters, use this :
( [\302-\337][\200-\277]|
[\340][\240-\277][\200-\277]|
[\355][\200-\237][\200-\277]|
[\341-\354\356-\357][\200-\277][\200-\277]|
[\360][\220-\277][\200-\277][\200-\277]|
[\361-\363][\200-\277][\200-\277][\200-\277]|
[\364][\200-\217][\200-\277][\200-\277] )
* please delete all newlines, spaces, or tabs in between (..)
* feel free to use bracket ranges {1,3} etc to optimize
the redundant listings of [\200-\277]. but don't change that
[\200-\277]+, as that might result in invalid encodings
due to either insufficient or too many continuation bytes
* although some historical UTF-8 references considers 5- and
6-byte encodings to be valid, as of Unicode 13 they only
consider up to 4-bytes
I've tested this string even against random binary files, and it would report the same multi-byte character count as gnu-wc.
Add in another [\000-\177]| at the front just after ( of that if you need full UTF8 matching string.
This regex is truly hideous yes, but it's also POSIX-compliant, cross-language and cross-platform compatible (doesn't depend on any special regex notation, (should be) fully UTF-8 compliant (Unicode 13), and completely independent of locale-setting.
if you're running grep with this, please use grep -P
If you just need the other bytes, then others have suggested already.
if you need the 11,172 characters of NFC-composed korean hangul it's
(([\352][\260-\277]|[\353\354][\200-\277]|
[\355][\200-\235])[\200-\277]|[\355][\236][\200-\243])
and if you need Japanese hiragana+katakana, it's
([\343]([\201-\203][\200-\277]|[\207][\260-\277]))