grep for emojis in linux - unicode

I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:
grep -P "[\x1F6-\x1F1]" contact_names.tokens
grep: range out of order in character class
https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f

You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value.
So the regex should be "[\x{1F1FF}-\x{1F600}]".
The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:
grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]" contact_names.tokens
(The range is borrowed from Suhail Gupta's answer on a similar question)
If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.

You could use ugrep as a drop-in replacement for grep to do this:
ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens
ugrep matches Unicode patterns by default (disabled with option -U).
The regular expression syntax is POSIX ERE compliant, extended with
Unicode character classes, lazy quantifiers, and negative patterns to
skip unwanted pattern matches to produce more precise results.
ugrep searches UTF-encoded input when UTF BOM (byte order mark) are
present and ASCII and UTF-8 when no UTF BOM is present. Option
--encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.
ugrep searches text and binary files and produces hexdumps for binary matches.
The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html

Related

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

Perl regex presumably removing non ASCII characters

I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.

How to remove <96> from a .txt file

I receive a .txt file with a lot of <96> which should be space instead.
In vi, I have done:
:%s/<96>//g
or
:%s/\<96>\//g
but it is still there. I did dos2unix, but it still doesn't remove it. Is it Unicode? If yes, how can I remove it? Thank you!
There's a good chance those aren't the four literal characters <, 9, 6 and >. Instead, they're probably the single character formed by the byte 0x96, which Vim renders as <96>.
You can see that by executing (from bash):
printf '123\x96abc\x96def' > file.txt ; vi file.txt
and you should see:
123<96>abc<96>def
To get rid of them, you can just use sed with something like (assuming your sed has in-place replacement):
sed -i.save 's/\x96//g' file.txt
You can also do this within vim itself, you just have to realise that you can enter arbitrary characters with CTRL-V (or CTRL-Q if CTRL-V is set up for paste). See here for details, paraphrased and shortened here to ensure answer is self-contained:
It is possible to enter any character which can be displayed in your current encoding, if you know the character value, as follows (^V means CTRL-V, or CTRL-Q if you use CTRL-V to paste):
Decimal: ^Vnnn, 000..255.
Octal: ^Vonnn, 000..377.
Hex: ^Vxnn, 00..ff.
Hex, BMP Unicode: ^Vunnnn, 0000..FFFF.
Hex, any Unicode: ^VUnnnnnnnn, 00000000..7FFFFFFF.
In all cases, initial zeros may be omitted if the next character typed is not a digit in the given base (except, of course, that the value zero must be entered as at least one zero).
Hex digits A-F, when used, can be typed in upper or lower case, or even in any mixture of them.
The key sequence you therefore want (assuming you want them replaced with spaces) is:
:%s/<CTRL-V>x96/ /g

Flex(lexer) support for unicode

I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
More:
Use regular expression to match ANY Chinese character in utf-8 encoding
At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
肖晗 { printf ("xiaohan\n"); }
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:
[肖晗] { printf ("xiaohan/2\n"); }
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).
For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
This isn't pretty, but it should work.
Flex does not support Unicode. However, Flex supports "8 bit clean" binary input. Therefore you can write lexical patterns which match UTF-8. You can use these patterns in specific lexical areas of the input language, for instance identifiers, comments or string literals.
This will work for well for typical programming languages, where you may be able to assert to the users of your implementation that the source language is written in ASCII/UTF-8 (and no other encoding is supported, period).
This approach won't work if your scanner must process text that can be in any encoding. It also won't work (very well) if you need to express lexical rules specifically for Unicode elements. I.e. you need Unicode characters and Unicode regexes in the scanner itself.
The idea is that you can recognize a pattern which includes UTF-8 bytes using a lex rule, (and then perhaps take the yytext, and convert it out of UTF-8 or at least validate it.)
For a working example, see the source code of the TXR language, in particular this file: http://www.kylheku.com/cgit/txr/tree/parser.l
Scroll down to this section:
ASC [\x00-\x7f]
ASCN [\x00-\t\v-\x7f]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
UANY {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UONLY {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
As you can see, we can define patterns to match ASCII characters as well as UTF-8 start and continuation bytes. UTF-8 is a lexical notation, and this is a lexical analyzer generator, so ... no problem!
Some explanations: The UANY means match any character, single-byte ASCII or multi-byte UTF-8. UANYN means like UANY but no not match the newline. This is useful for tokens that do not break across lines, like say a comment from # to the end of the line, containing international text. UONLY means match only a UTF-8 extended character, not an ASCII one. This is useful for writing a lex rule which needs to exclude certain specific ASCII characters (not just newline) but all extended characters are okay.
DISCLAIMER: Note that the scanner's rules use a function called utf8_dup_from to convert the yytext to wide character strings containing Unicode codepoints. That function is robust; it detects problems like overlong sequences and invalid bytes and properly handles them. I.e. this program is not relying on these lex rules to do the validation and conversion, just to do the basic lexical recognition. These rules will recognize an overlong form (like an ASCII code encoded using several bytes) as valid syntax, but the conversion function will treat them properly. In any case, I don't expect UTF-8 related security issues in the program source code, since you have to trust source code to be running it anyway (but data handled by the program may not be trusted!) If you're writing a scanner for untrusted UTF-8 data, take care!
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++.
RE/flex safely supports the full Unicode standard and accepts UTF-8, UTF-16, and UTF-32 input files without requiring UTF-8 hacks (that can't even support UTF-16/32 input and handle UTF BOM.)
Also, UTF-8 hacks with Flex don't allow you to write Unicode regular expressions such as [肖晗] that are fully supported in RE/flex.
It works seamlessly with Bison to build lexers and parsers.
In fact, with RE/flex we can write any Unicode patterns as UTF-8-based regular expressions in lexer .l specifications, such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
This generates a lexer that scans UTF-8, UTF-16, and UTF-32 files automatically. As per UTF standardization, for UTF-16/32 input a UTF BOM is expected in the input, while an UTF-8 BOM is optional.
We can use global %option unicode to enable Unicode and %option flex to specify Flex specifications. A local modifier (?u:) can be used to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

What is the range of Unicode Printable Characters?

Can anybody please tell me what is the range of Unicode printable characters? [e.g. Ascii printable character range is \u0020 - \u007f]
See, http://en.wikipedia.org/wiki/Unicode_control_characters
You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes
The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F
other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.
More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.
This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.
Unicode
Unicode defines properties for characters.
One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.
By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.
You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.
Programming Language support
Some programming languages assist with this problem.
For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:
func IsGraphic(r rune) bool
IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such
characters include letters, marks, numbers, punctuation, symbols, and spaces,
from categories L, M, N, P, S, Zs.
func IsPrint(r rune) bool
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.
Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.
Printable
The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.
In particular whether a particular "character" is printable is not always obvious.
Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?
Footnotes
ASCII printable character range is \u0020 - \u007f
No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).
In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.
First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).
Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.
Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.
It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.
What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).
Taking the opposite approach to #HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use not to test if a character is printable.
In the style of Regex (so if you wanted printable characters, place a ^):
[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]
Which accounts for things like separator spaces and joiners
Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes Non-Latin, Language Supplement blocks as 'printable', even though it contains things like 'zero-width non-joiner'..).
Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case, non-breaking spaces should change to space, not be removed, and invisible separator should be replaced with comma conditionally.
Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..
NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via /.../u).
You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range \u{E0100}-\u{E01EF} in javascript:
/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
Without u \u{E0100}-\u{E01EF} equates to \uDB40(\uDD00-\uDB40)\uDDEF, not (\uDB40\uDD00)-(\uDB40\uDDEF), and if replacing you should always enable u even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.
What characters are valid?
At present, Unicode is defined as starting from U+0000 and ending at U+10FFFF. The first block, Basic Latin, spans U+0000 to U+007F and the last block, Supplementary Private Use Area-B, spans U+100000 to 10FFFF. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.
Let's break down what's valid/invalid in the Latin Block1.
The Latin Block: TLDR
If you're interested in filtering out either invisible characters, you'll want to filter out:
U+0000 to U+0008: Control
U+000E to U+001F: Device (i.e., Control)
U+007F: Delete (Control)
U+008D to U+009F: Device (i.e., Control)
The Latin Block: Full Ranges
Here's the Latin block, broken up into smaller sections...
U+0000 to U+0008: Control
U+0009 to U+000C: Space
U+000E to U+001F: Device (i.e., Control)
U+0020: Space
U+0021 to U+002F: Symbols
U+0030 to U+0039: Numbers
U+003A to U+0040: Symbols
U+0041 to U+005A: Uppercase Letters
U+005B to U+0060: Symbols
U+0061 to U+007A: Lowercase Letters
U+007B to U+007E: Symbols
U+007F: Delete (Control)
U+0080 to U+008C: Latin1-Supplement symbols.
U+008D to U+009F: Device (i.e., Control)
U+00A0: Non-breaking space. (i.e., )
U+00A1 to U+00BF: Symbols.
U+00C0 to U+00FF: Accented characters.
The Other Blocks
Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.
Latin1 & Latin1-Related Blocks
U+0000 to U+007F : Basic Latin
U+0080 to U+00FF : Latin-1 Supplement
U+0100 to U+017F : Latin Extended-A
U+0180 to U+024F : Latin Extended-B
Combinable blocks
U+0250 to U+036F: 3 Blocks.
Non-Latin, Language blocks
U+0370 to U+1C7F: 55 Blocks.
Non-Latin, Language Supplement blocks
U+1C80 to U+209F: 11 Blocks.
Symbol blocks
U+20A0 to U+2BFF: 22 Blocks.
Ancient Language blocks
U+2C00 to U+2C5F: 1 Block (Glagolitic).
Language Extensions blocks
U+2C60 to U+FFEF: 66 Blocks.
Special blocks
U+FFF0 to U+FFFF: 1 Block (Specials).
One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.
I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)
I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.
The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.
Unicode, stict term, has no range. Numbers can go infinite.
What you gave is not UTF8 which has 1 byte for ASCII characters.
As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.