command line filtering of Unicode block - perl

I've been trying for a couple hours to create a conceptually trivial filter that I can use on the command line, without success. The task is to filter out all lines containing Hangul Jamo characters, while retaining all other lines (which may contain ASCII, characters in the Hangul Syllable block, etc.).
So for example if the input was
foo
ᅤᆨ
간
the output would contain the first and third lines, but not the second, since the second line contains Jamo characters. (The above is not meant to be real Korean, just a simple test case.)
I'm very disappointed with the Gnu grep utility (version 2.20). I would have thought the ff. would work:
grep -Pv '[\x{1100}-\x{11FF}]'
but instead I get the error message grep: character value in \x{...} sequence is too large. (The \u1100 syntax, which is the actual Perl syntax, simply isn't supported.)
(I do notice that our version 2.20 is rather old. If someone tries the above with a newer version of grep, and it works, I'll certainly consider that an answer--and I'll get our IT folks to upgrade!)
I tried sed, but didn't get any further. (Sorry, I don't remember exactly what sed commands I tried, but sed's support for Unicode blocks doesn't seem any better than grep's.)
Finally, I tried perl (v5.16.3):
perl -ne 'print unless /[\u1100-\u11ff]/'
This at least succeeds in eliminating the Jamo lines while retaining the Hangul Syllable lines, but it also eliminates the ASCII lines, which I don't want to do. I also would have thought one of the ff. would work:
perl -ne 'print unless /\p{InHangul_Jamo}/'
perl -ne 'print unless /\p{Block: Hangul_Jamo}/'
but neither appears to have any effect. (Afaik, I shouldn't have to have a .* on each side of the \p{...}, but I tried that too; no luck.)
Locale: in case it matters, I have LANG=en_US.UTF-8.
I'm sure I could do this in Python, but I'd like to understand why neither grep nor perl seems to work, because they'd be a lot simpler. (And if I'm right about the Gnu utilities having poor Unicode support, why that is...and when it will be fixed. It's not like Unicode is new!) Of course I realize the problem may be that I'm not holding my mouth right when I try, but if so, it would be nice for grep at least to have better documentation on Unicode usage. Right now the documentation for grep -P says "This is highly experimental and grep -P may warn of unimplemented features." And it seems to have been that way roughly forever.

Decode inputs, encode outputs. If the encoding in question is UTF-8, the command-line switch -CSD will come in useful.
perl -CSD -ne'print if !/\p{Block: Hangul_Jamo}/'
perl -CSD -ne'print if !/\p{Block: Jamo}/'
perl -CSD -ne'print if !/\p{Blk=Jamo}/'
perl -CSD -ne'print if !/\p{InJamo}/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}]/'
grep -vP '[\x{1100}-\x{11FF}]'
You might want to add the Hangul_Jamo_Extended_A, Hangul_Jamo_Extended_B and Hangul_Compatibility_Jamo blocks.
perl -CSD -ne'print if !/[\p{Block: Hangul_Jamo}\p{Block: Hangul_Jamo_Extended_A}\p{Block: Hangul_Jamo_Extended_B}\p{Block: Hangul_Compatibility_Jamo}]/'
perl -CSD -ne'print if !/[\p{Block: Jamo}\p{Block: JamoExtA}\p{Block: JamoExtB}\p{Block: CompatJamo}]/'
perl -CSD -ne'print if !/[\p{Blk=Jamo}\p{Blk=JamoExtA}\p{Blk=JamoExtB}\p{Blk=CompatJamo}]/'
perl -CSD -ne'print if !/[\p{InJamo}\p{InJamoExtA}\p{InJamoExtB}\p{InCompatJamo}]/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}\N{U+A960}-\N{U+A97F}\N{U+D7B0}-\N{U+D7FF}\N{U+3130}-\N{U+318F}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]/'
grep -vP '[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]'
Let's look at your failed attempts.
grep -Pv '[\x{1100}-\x{11FF}]'
Actually, this one should work, and it does for me.
$ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | od -t x1
0000000 61 62 63 0a 64 e1 84 80 66 0a 67 68 69 0a
0000016
$ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | grep -Pv '[\x{1100}-\x{11FF}]'
abc
ghi
$ grep --version | head -1
grep (GNU grep) 2.16
I do get your error on an older machine with grep (GNU grep) 2.10.
perl -ne'print unless /\p{Block: Hangul_Jamo}/'
You didn't get any matches from /\p{Block: Hangul_Jamo}/ because you were matching against encoded text (UTF-8 bytes, chars in the range 00..FF) instead of decoded text (Unicode Code Points, chars in the range 00000..10FFFF).
perl -ne 'print unless /\p{InHangul_Jamo}/'
\p{Block: X}, \p{Blk=X} and \p{InX} are equivalent.
perl -ne'print unless /[\x{1100}-\x{11FF}]/'
[\x{1100}-\x{11FF}] is equivalent to \p{Block: Hangul_Jamo}.
perl -ne'print unless /[\u1100-\u11ff]/'
You got too many matches since \u in double-quoted string literals and in regex pattern literals titlecases the next character. (e.g. "\uxyx" is equivalent to "Xyz".)
As such, [\u1100-\u11ff] is equivalent to [01f].

for what it's worth, this is my own jamo filter in gnu-grep :
noJamo is an alias for
ggrep -vP '[\x{1100}-\x{11FF}
\x{A960}-\x{A97F}
\x{D7B0}-\x{D7FF}
\x{3130}-\x{318F}]'
However, if you only care about the core Jamo set that maps to 11,172 syllables, and don't mind using something other than grep, then this should be extremely fast :
\341\204[\200-\222]|
\341\205[\241-\265]|
\341\206[\250-\277]|\341\207[\200-\202]
if you add up the octals in each line, they're exactly 19 cho in row 1, 21 jung in row 2, and 28 jong in row 3.
I did a quick benchmark with a synthetic 5.55 GB .txt file containing lines that add up to some 4.3 GB.
And this regex's filtering throughput was some 1.55 GB/sec, practically at the limit of my SSD I/O.
(time (pvE0 < jamotest000001.txt|
mawk2 'BEGIN{ FS=ORS }
/\341(\204[\200-\222]|
\205[\241-\265]|
\206[\250-\277]|
\207[\200-\202] )/'
| pvE9 | xxh128sum))| ecp;
in0: 5.55GiB 0:00:03 [1.55GiB/s] [1.55GiB/s]
[=================>] 100%
out9: 4.29GiB 0:00:03 [1.20GiB/s] [1.20GiB/s]
[ <=> ]
( pvE 0.1 in0 < jamotest000001.txt | mawk2 | pvE 0.1 out9 | xxh128sum; )
3.70s user 2.73s system 178% cpu 3.597 total
f4ef119214a3c39c7c560ad24491b96c stdin

Related

Why does Encode::decode with non-latin letter locales blow up on localised strftime output?

On Ubuntu with Perl 5.26.1 I have encountered the following problem when working on Dancer::Logger::Console. I've lifted this code out of Dancer2::Core::Role::Logger.
In order to run this, you need to generate the following locales:
sudo locale-gen de_DE.UTF-8
sudo locale-gen ko_KR.UTF-8
This example code uses the Korean locale, and fails without an error message. $# is empty.
$ LC_ALL=ko_KR.UTF-8 perl -MPOSIX -MEncode -E 'eval {
say Encode::decode("UTF-8", strftime("%b", localtime))
};
say $#;
'
Wide character at -e line 1.
When run with a German locale, it succeeds (but throws a wide character warning, which we can ignore for this test).
$ LC_ALL=de_DE.UTF-8 perl -MPOSIX -MEncode -E 'eval {
say Encode::decode("UTF-8", strftime("%b", localtime))
};
say $#;
'
Wide character in say at -e line 2.
M�r
The %b formatting is the abbreviated month as localised word (see http://strftime.net/).
If we don't Encode::decode("UTF-8", ...), it works, and the version above with Korean produces 3월.
What's going on here?
Under ko_KR.UTF-8, strftime("%b", localtime(1552997524)) returns 20.33.C6D4. When interpreted as Unicode Code Points, this is "␠3월" ("March", with a leading space).
Under de_DE.UTF-8, strftime("%b", localtime(1552997524)) returns 4D.E4.72. When interpreted as Unicode Code Points, this is "Mär" (short form of "März", "March").
So it seems decoded text (Unicode Code Points) are being returned, which is perfect. All that's left to do is to encode the outputs.
$ LC_ALL=ko_KR.UTF-8 perl -CSD -MPOSIX -e'CORE::say strftime("%b", localtime)'
3월
$ LC_ALL=de_DE.UTF-8 perl -CSD -MPOSIX -e'CORE::say strftime("%b", localtime)'
Mär
In a program (as opposed to a one-liner), you could use something like the following instead of -CSD:
use open ':std', ':encoding(UTF-8)';

Perl regex replacement of logical unicode characters

Here is a simple substitution that adds parentheses arounds upper-case characters in an unicode string. As you can see, the result is rather ugly:
~$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
My understanding is that the regex operates on "code points" instead of "logical characters", which splits my 'é' into meaningless characters. Is there a way to force the regex to work on logical unicode characters at once ?
Thanks,
As illustrated by the other answers, turning on UTF-8 in Perl is a piecemeal process. There's use utf8 for the syntax and raw strings. Then you have to make sure all your filehandles are UTF-8. What about #ARGV? readdir? glob? The output from ``?
There's nothing worse than having half your program working in ASCII and the other half working in UTF-8. utf8::all to the rescue!
Install it, add use utf8::all, and it will turn on UTF-8... all of it. Someone else figured it out, you don't have to worry about it.
$ echo "Whatéver 5" | perl -ape "use utf8::all; s/(\p{Upper})/(\1)/g"
(W)hatéver 5
You haven't told Perl to expect UTF-8 input, so it is treating each byte of the encoding as a separate character
Within a program you can set the default encoding for the three standard IO channels like this
use open ':std' => ':encoding(UTF-8)'
On the command line, the option -CS does the same thing, so this should work for you. I have removed the unnecessary autosplit option and replaced \1 with the correct $1 in the replacement string
echo "Whatéver 5" | perl -CS -pe "s/(\p{Upper})/($1)/g"
Assuming that your terminal uses UTF-8 encoding,
$ echo -n "é" | perl -ne 'printf "%vX\n", $_'
gives
C3.A9
so the input to the Perl program has not been converted internally to Unicode (it is still a string of UTF-8 bytes)
To convert the input to a Perl string, add a UTF-8 layer on the standard input stream using option -CI :
$ echo -n "é" | perl -CI -ne 'printf "%vX\n", $_'
the output is now
E9
However, if you also try to print the character back to standard output
you will not get é but a unicode replacement character � from the terminal. This is because the character 0xE9 is Unicode, but the terminal expect UTF-8, and 0xE9 is not valid UTF-8:
$ echo -n "é" | perl -CI -nE 'printf "$_: %vX\n", $_, $_'
�: E9
To get correct output, you can add an UFT-8 encoding layer on the standard output stream also (using -CO flag):
$ echo -n "é" | perl -CIO -nE 'printf "$_: %vX\n", $_, $_'
é: E9
According to perlunicode
"Upper" is a synonym for "Uppercase" , and we could have written
\p{Uppercase} equivalently as \p{Upper}
and
For instance, \p{Uppercase} matches any single character with the
Unicode "Uppercase" property
It seems like if you try to use \p{Upper} on a byte string, you will not get any warnings from Perl. Also bytes in the range 0xC0 to 0xDE will match the uppercase property. Try
perl -E 'for $i (0x80..0xFF) {$_=chr $i; printf "%x\n", $i if /\p{Upper}/}'
This explains the output you got:
$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
Here, the letter é is represented as 2 bytes (in UTF-8) 0xC3 and 0xA9, and 0xC3 will match the Unicode Upper property.
A solution to your problem is therefore to add UTF-8 encoding layers on the standard input and output (you can combine -CI and -CO using -CS):
echo "Whatéver 5" | perl -CS -ape "s/(\p{Upper})/(\1)/g"
with output:
(W)hatéver 5

Line number of a file in Perl when multiple files are passed as arguments to perl cli

In awk if I give more than one file as an argument to awk, there are two special variables:
NR=line number corresponding to all the lines in all the files.
FNR=line number of the current file.
I know that in Perl, $. corresponds to NR (current line among lines in all of the files).
Is there anything comparable to FNR of AWK in Perl too?
Let's say I have some command line:
perl -pe 'print filename,<something special which hold the current file's line number>' *.txt
This should give me output like:
file1.txt 1
file1.txt 2
file2.txt 1
Actually, the eof documentation shows a way to do this:
# reset line numbering on each input file
while (<>) {
next if /^\s*#/; # skip comments
print "$.\t$_";
} continue {
close ARGV if eof; # Not eof()!
}
An example one-liner that prints the first line of all files:
$ perl -ne 'print "$ARGV : $_" if $. == 1; } continue { close ARGV if eof;' *txt
There is no such variable in Perl. But you should study eof to be able to write something like
perl -ne 'print join ":", $. + $sum, $., "\n"; $sum += $., $.=0 if eof;' *txt
I find this situation to be a very large drawback to using Perl. While this answer is suboptimal, performance-wise, and only fits situations involving xargs, I've typically found it the workaround I use 95% of the time. So the problem scenario:
git ls-files -z | xargs -0 perl -ne 'print "$ARGV:$.\t$_" if /#define oom/'
file.h:85806 #define oom() exit(-1)
That line number is obviously not correct, and you'll obviously get the same behavior with find Of course, with this regex, or other simpler Perl regexes, I'd just use awk or git grep -P. However, if you're dealing with a fairly complicated regular expression, or need other Perl features, that won't work...and the correct answers to this question further complicate what is already likely to be a complicated Perl one-liner.
So I just use the following:
git ls-files -z | xargs -0 -n1 perl -ne 'print "$ARGV:$.\t$_" if /#define oom/'
file.h:43 #define oom() exit(-1)
The -n1 xargs argument causes xargs to kick off a Perl process for each file, which results in the correct line numbers. You're looking at very significant performance impact, but I've found it acceptable for very large projects with millions of lines of code vs. solving it in Perl, which I find almost always requires an actual script vs. a one-liner. It is not acceptable for system-wide searches in the vast majority of cases.

grep regex to perl or awk

I have been using Linux env and recently migrated to solaris. Unfortunately one of my bash scripts requires the use of grep with the P switch [ pcre support ] .As Solaris doesnt support the pcre option for grep , I am obliged to find another solution to the problem.And pcregrep seems to have an obvious loop bug and sed -r option is unsupported !
I hope that using perl or nawk will solve the problem on solaris.
I have not yet used perl in my script and am unware neither of its syntax nor the flags.
Since it is pcre , I beleive that a perl scripter can help me out in a matter of minutes. They should match over multiple lines .
Which one would be a better solution in terms of efficiency the awk or the perl solution ?
Thanks for the replies .
These are some grep to perl conversions you might need:
grep -P PATTERN FILE(s) ---> perl -nle 'print if m/PATTERN/' FILE(s)
grep -Po PATTERN FILE(s) ---> perl -nle 'print "$1\n" while m/(PATTERN)/g' FILE(s)
That's my guess as to what you're looking for, if grep -P is out of the question.
Here's a shorty:
grep -P /regex/ ====> perl -ne 'print if /regex/;'
The -n takes each line of the file as input. Each line is put into a special perl variable called $_ as Perl loops through the whole file.
The -e says the Perl program is on the command line instead of passing it a file.
The Perl print command automatically prints out whatever is in $_ if you don't specify for it to print out anything else.
The if /regex/ matches the regular expression against whatever line of your file is in the $_ variable.

Finding non-Ascii character [duplicate]

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented
features.
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
So the first solution for instance would become:
grep --color='auto' -P -n '[^\x00-\x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:
pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml
Any pros or cons that anyone can think off?
The following works for me:
grep -P "[\x80-\xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of \xdd escapes in character classes to accomplish what you want.
The easy way is to define a non-ASCII character... as a character that is not an ASCII character.
LC_ALL=C grep '[^ -~]' file.xml
Add a tab after the ^ if necessary.
Setting LC_COLLATE=C avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C avoids locale-dependent effects altogether.
In perl
perl -ane '{ if(m/[[:^ascii:]]/) { print } }' fileName > newFile
Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF] in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:
grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
Note: my computer's grep (a Mac) did not have -P option, so I did brew install grep and started the call above with ggrep instead of grep.
Searching for non-printable chars. TLDR; Executive Summary
search for control chars AND extended unicode
locale setting e.g. LC_ALL=C needed to make grep do what you might expect with extended unicode
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
. . more . . excruciating detail on this: . . .
I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"
Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.
I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:
grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
ACTUALLY, generally you will need to do this:
LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
breakdown:
LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches
E.g. practical example of use find to grep all files under current directory:
LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +
You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.
Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes
Dec Hex Ctrl Char description Dec Hex Ctrl Char description
0 00 ^# NULL 16 10 ^P DATA LINK ESCAPE (DLE)
1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1)
2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2)
3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3)
4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4)
5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK)
6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN)
7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB)
8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN)
9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM)
10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB)
11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC)
12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW
13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW
14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW
15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW
UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. #frabjous asked and #calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.
e.g. my locale LC_ALL= empty
$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=
grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:
$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call underscore c2a0
9:CTRL
31:5 © copyright
32:7 call underscore
grep with LC_ALL=C does seem to match all extended characters that you would want:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
1 ‐‐ unicode dashes e28090
3 💘 Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call underscore c2a0
9 CTRL-H CHARS URK URK URK
11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3 💘 Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call underscore
33 11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
34 52 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
73 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
The following code works:
find /tmp | perl -ne 'print if /[^[:ascii:]]/'
Replace /tmp with the name of the directory you want to search through.
This method should work with any POSIX-compliant version of awk and iconv.
We can take advantage of file and tr as well.
curl is not POSIX, of course.
Solutions above may be better in some cases, but they seem to depend on GNU/Linux implementations or additional tools.
Just get a sample file somehow:
$ curl -LOs http://gutenberg.org/files/84/84-0.txt
$ file 84-0.txt
84-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
Search for UTF-8 characters:
$ awk '/[\x80-\xFF]/ { print }' 84-0.txt
or non-ASCII
$ awk '/[^[:ascii:]]/ { print }' 84-0.txt
Convert UTF-8 to ASCII, removing problematic characters (including BOM which should not be in UTF-8 anyway):
$ iconv -c -t ASCII 84-0.txt > 84-ascii.txt
Check it:
$ file 84-ascii.txt
84-ascii.txt: ASCII text, with CRLF line terminators
Tweak it to remove DOS line endings / ^M ("CRLF line terminators"):
$ tr -d '\015' < 84-ascii.txt > 84-tweaked.txt && file 84-tweaked.txt
84-tweaked.txt: ASCII text
This method discards any "bad" characters it cannot deal with, so you may need to sanitize / validate the output. YMMV
Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:
cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'
For unicode characters (like \u2212 in example below) use this:
find . ... -exec perl -CA -e '$ARGV = #ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8
grep -v $'\u200d'
Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.
For the former, try one of these (variable file is used for automation):
file=file.txt ; LC_ALL=C grep -Piao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[^\x00-\x19\x21-\x7F]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.
ASCII range is x00-x7F, space is x20, since strings have spaces the negative range omits it.
Non-ASCII range is x80-xFF, since strings have spaces the positive range adds it.
String is presumed to be at least 7 consecutive characters within the range. {7,}.
For shell readable output, uchardet $file returns a guess of the file encoding which is passed to iconv for automatic interpolation.
if you're trying to grab/grep UTF8-compliant multibyte-characters, use this :
( [\302-\337][\200-\277]|
[\340][\240-\277][\200-\277]|
[\355][\200-\237][\200-\277]|
[\341-\354\356-\357][\200-\277][\200-\277]|
[\360][\220-\277][\200-\277][\200-\277]|
[\361-\363][\200-\277][\200-\277][\200-\277]|
[\364][\200-\217][\200-\277][\200-\277] )
* please delete all newlines, spaces, or tabs in between (..)
* feel free to use bracket ranges {1,3} etc to optimize
the redundant listings of [\200-\277]. but don't change that
[\200-\277]+, as that might result in invalid encodings
due to either insufficient or too many continuation bytes
* although some historical UTF-8 references considers 5- and
6-byte encodings to be valid, as of Unicode 13 they only
consider up to 4-bytes
I've tested this string even against random binary files, and it would report the same multi-byte character count as gnu-wc.
Add in another [\000-\177]| at the front just after ( of that if you need full UTF8 matching string.
This regex is truly hideous yes, but it's also POSIX-compliant, cross-language and cross-platform compatible (doesn't depend on any special regex notation, (should be) fully UTF-8 compliant (Unicode 13), and completely independent of locale-setting.
if you're running grep with this, please use grep -P
If you just need the other bytes, then others have suggested already.
if you need the 11,172 characters of NFC-composed korean hangul it's
(([\352][\260-\277]|[\353\354][\200-\277]|
[\355][\200-\235])[\200-\277]|[\355][\236][\200-\243])
and if you need Japanese hiragana+katakana, it's
([\343]([\201-\203][\200-\277]|[\207][\260-\277]))