Two different eye emojis? - unicode

As far as I knew, there are currently two emojis for eyes. The pair of eyes (U+1F440) with hex code f09f9180 (👀), and a single eye (U+1F441) with hex code f09f9181 (👁).
I now found when using the emojis of the keyboard in my phone that another eye emoji exists, with hex code f09f9181efb88f (👁️).
The gajim messenger on the PC, and the Conversations app on the mobile phone, can display both. The gajim emoji-chooser only contains the short sequence and the Swiftkey-Keyboard Emoji-Chooser only the longer one.
When I copy and paste the emojis i.e. in the Firefox URL address bar, they look the same (blue eye, while the messengers both display them in black). When I Google for the emojis, I only find pages describing the shorter code point.
Firefox renders both emojis the same, but Vivaldi (Chromium based) shows the one with the shorter code point as narrow black and white emoji and the other one as larger brown eye.
When I Google for the hex dump, I find a lot of emojipedia sites for the shorter dump, and nothing useful at all for the longer one.
Is there somewhere any documentation about the additional emoji? Why aren't both emojis available in both emoji choosers?

f0 9f 91 80 is the UTF-8 encoded form of codepoint U+1F440.
f0 9f 91 81 is the UTF-8 encoded form of codepoint U+1F441.
f0 9f 91 81 ef b8 8f is the UTF-8 encoded form of codepoints U+1F441 U+FE0F.
U+FE0F is a Variation Selector:
Variation Selectors is a Unicode block containing 16 Variation Selector format characters (designated VS1 through VS16). They are used to specify a specific glyph variant for a Unicode character. They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs. At present only standardized variation sequences with VS1, VS15 and VS16 have been defined.
Where U+FE0F is VARIATION SELECTOR-16:
U+FE0F was added to Unicode in version 3.2 (2002). It belongs to the block Variation Selectors in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property from the preceding character.
The glyph is not a composition. It has a Ambiguous East Asian Width. In bidirectional context it acts as Nonspacing Mark and is not mirrored. In text U+FE0F behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.
This codepoint may change the appearance of the preceding character. If that is a symbol, dingbat or emoji, U+FE0F forces it to be rendered as a colorful image as compared to a monochrome text variant. The Unicode standard defines some standardized variants. See also “Unicode symbol as text or emoji” for a discussion of this codepoint.
In other words, U+FE0F tells VS-aware software to render U+1F441 as a colorful emoji instead of as monochromatic text.

The singular ‘👁’ is used as an emoji, but is defined as being text-style (i.e. black-and-white rather than colourful) by default. This isn’t implemented consistently across all platforms, however, so sometimes the character will also display as emoji style instead. In order to explicitly force one or the other style, the characters U+FE0E and U+FE0F can be appended to 👁 to make it appear as text style (👁︎) or emoji style (👁️) respectively. Because of the inconsistencies I mentioned, some devices and applications automatically add U+FE0F to the character (resulting in the longer code your phone keyboard produced), while others leave the character as-is (leaving just the code for the eye itself).

Related

How do you determine the byte width of a UTF-16 character?

What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).
From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:
Value of leading 2 bytes
Resulting character length (bytes)
0x0000-0xC7FF
2
0xD800-0xDBFF
4
0xDC00-0xDFFF
Invalid sequence (RFC2781 2.2.2)
0xDFFF-0xFFFF
4
In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:
Character
Unicode Name
ISO 10646
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
A
LATIN CAPITAL LETTER A
U+0041
00 41
2
Б
CYRILLIC CAPITAL LETTER BE
U+0411
04 11
2
ァ
KATAKANA LETTER SMALL A
U+30A1
30 A1
2
🐰
RABBIT FACE
U+1F430
D8 3D DC 30
4
However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:
Character
Unicode Name
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
⚕️
STAFF OF AESCULAPIUS
26 95 FE 0F
4
Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).
select cast(N'⚕️' as varbinary);
----------
0x95260FFE
Quite simply, how/why do you read 0x2695 and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?
The formal definition of all of this is called an "extended grapheme cluster," and it's defined in the Unicode Text Segmentation report. As Joachim Sauer notes, it's wise to be careful with the term "character" in Unicode.
Code points are what "U+...." syntax is referring to, and is attempting to capture a "unit" of written language, for example "an acute accent." But what a reader would think of a character (for example "an e with an acute accent") is a "grapheme cluster" and is made up of one or more code points. What is ultimately rendered to the screen is a "glyph" which is both context- and font-dependent.
Grapheme clusters in Unicode are actually more subtle than this. Unicode attempts to define them in a "neutral" way. (There's really no such thing as "neutral" when thinking about languages, but Unicode does try.) For example, in Slovak, ch, dz, and dž are each one letter, but are considered two grapheme clusters in Unicode. (Try to count the "letters" in a Slovak word. There are words that contain the letter dz and other words that have the letter d followed by the letter z. Oh human writing systems. I love you so much.)
The mapping of grapheme clusters to glyphs is also complex. For example, in Arabic, the single glyph لا is actually two grapheme clusters, ل (ARABIC LETTER LAM) followed by ا (ARABIC LETTER ALEF). If you use your mouse to select the glyph, you'll see there are two selectable pieces, and if you copy and paste them to another window you'll see them transform into their component parts. (Just to make thing even more complicated, Unicode also defines a single code point for ligature, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM: ﻻ. If you try to select part of that one, you'll find you can't. It's one "character.")
Your specific case is a bit more special. The Variation Selector predates Unicode, and is mostly designed to handle different variations of Han (Chinese) characters. However, as with every Unicode feature, it eventually has come to be used primarily for emoji. VS-16 is the "emoji" presentation form. The most famous example is the red heart, which is HEAVY BLACK HEART ❤, followed by VS-16: ❤️.
Similarly, your character U+2695 STAFF OF AESCULAPIUS is a single code point, and it looks like this by default (text style): ⚕. When you add VS-16, it is rendered in "emoji style": ⚕️. In some ways it's the same "character." Or is it? Depends on what you're using it for.
Emoji style is typically a bit larger and centered in its block, sometimes adding color. Notice where the period after the staff is drawn in each case (there are no extra spaces in the second example; the glyph is just much wider).
There are other combining systems as well:
U+0031: 1
U+0031 U+20e3: 1⃣ (+ COMBINING ENCLOSING KEYCAP, default text style)
U+0031 U+20e3 U+fe0f: 1⃣️ (+ VARIATION SELECTOR-16, emoji style)
All of these predate Unicode. Modern emoji is dramatically more complicated, and includes several combining systems of its own (including two that are currently just used for flags).
But luckily, to your actual question, your wife is correct, and you can generally just consume all trailing code points that are marked "combining" to form an extended grapheme cluster, and that is kind of a "character" for some broad enough definition of "character."
All of your assertions are completely correct; your interpretation of the UTF-16 standards is correct and complete.
In your empirical observations however, you've assumed that you only have one character. In actuality, you've ran into a nuance of the Unicode implementation. Your "character" is actually two (albeit technically, not visually): U+2695 "STAFF OF AESCULAPIUS" followed by U+FE0F "VARIATION SELECTOR-16". The second character is a non-spacing mark which combines with the base character for the purpose of rendering a character variant.
This results in the byte sequence 26 95 FE 0F, however as you note neither of the words fall within the UTF-16 reserved extension character range. But this is because neither of them require the UTF-16 4 byte extension. They're simply classified as two discrete Unicode characters.
From 7.9 Combining Marks in ISO 10646: Universal Coded Character Set (UCS):,
Combining marks are a special class of characters in the Unicode Standard that are
intended to combine with a preceding character, called their base.
Combining marks usually have a visible glyphic form... a combining mark may interact graphically with neighbouring characters in various ways.
http://unicode.org/L2/L2010/10038-fcd10646-main.pdf
To explain why I'm answering my own question; I had my SO question all ready to fire off. My wife came into my office; after looking over my shoulder she whispered into my ear, "You know combination characters are a thing, right?". I've however still asked the question and answered it myself, in case my wife's sweet nothings help another member of the community.

What range of unicode characters should be kept in a #font-face web font for a US based website with a US audience?

As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)

What is the meaning of the indicator XXX in the Unicode charts

Consider the unicode chart for C1 Controls and Latin-1 supplement in Unicode Charts. If a character has a glyph, it is shown, if it does not have a glyph, a special dotted line and symbolic marker or identifier is given. In this case, both 0080 and 0081 seem to have some "invalid marker", which I think is what "XXX" means. Is that what it means?
Secondly, what should be the behaviour of a Unicode aware string type that has a value stored into the string of value 0x80 (hex) or 128 (decimal)? Should it be converted to some other point, such as the mapping like this:
Byte Value 128 in many ANSI Codepages is the EURO marker.
Storing a 128 decimal value is equivalent to storing U+20AC ?
The magic "non orthogonality" I have encountered in a particular language or operating system API implementation of its MBCS and Unicode types, and Java's interesting handling, leads me to wonder, what is the real intended use of the U+0080 character? This reference link confuses me by showing that Java treats this character as a Euro symbol (ANSI codepage to Unicode one way friendliness) but that it's name is <control> which is not anything I know how to deal with. Wikipedia says it's PAD here
Can anyone help me? Did I skip a foundational concepts day at Unicode School? What am I missing?
Update The block from 0080 to 0098 is non printable control characters. This much I know. What I wonder is what does the XXX mean and how am I to think of this character when I am processing unicode data with this value in it?
According to the explanation in Ch. 17 (About the Code Charts) of the Unicode Standard, p. 573, by the “Dashed Box Convention”, characters that have no visible rendering as such “are represented by a square dashed box. This box surrounds a short mnemonic abbreviation of the character’s name.” The characters referred to in the questions are control characters, in the C1 Controls area.
The Unicode Standard says, in Ch. 16, p. 544, about C0 and C1 Controls: “The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are gen-erally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.” And the abbreviations in the square dashed boxes reflect the meanings given in ISO/IEC 6429:1992.
Some code points in the C1 Controls area are not defined in ISO/IEC 6429:1992. For them, such as U+0080, the code chart has “XXX” in place of a mnemonic abbreviation. So this indicates that the Unicode standard does not refer to any meaning for those code points, beyond their being control characters with some abstract properties.
Thus, “XXX” does not mean “invalid”, but rather “completely undefined meaning”. The meaning of such code points can be defined by various standards or other conventions, as long as they are consistent with the general definitions—e.g., it would be incompatible to define U+0080 as a graphic character.
Such code points must not be replaced or omitted in any character-level processing; applications that actually change data may do whatever they want, but any general conversion routines, for example, must keep these code points (characters) intact. They must not be treated as malformed or invalid; but an application may treat them as undefined. By Unicode principles, it’s OK to be ignorant of a character, but not completely wrong about it.
This has nothing to do with the meaning of bytes like 0x80 in 8-bit codes like Windows-1252. But if you send e.g. data labeled as ISO-8859-1 encoded (where e.g. 0x80 is in principle U+0080) to a web browser, it will actually treat it as Windows-1252 encoded. The reason is that characters like U+0080 are practically never used in ISO-8859-1 data; occurrence of 0x80 in ISO-8859-1 labeled data is virtually always either windows-1252 mislabeled or messed-up data that cannot be meaningfully processed. So browsers take the practical route and treat ISO-8859-1 as windows-1252; this is being formalized in HTML5 and related specifications.

What is the unicode variation selector

I was wondering. What is the unicode Variation Selectors U-FE00 to U-FE0F used for.
Example: ︀︁︂︂ 
The Unicode standard talks about this. Here's a bit of the relevant section from 3.2.0, annex 28 (I'm sure there are more recent versions around; this is the first I found):
Unicode characters can be represented by a wide variety of glyphs, as discussed in Chapter 2, General Structure in The Unicode Standard, Version 3.0. Occasionally the need arises in text processing to restrict or change the set of glyphs that are to be used to represent a character. Normally such changes are indicated by choice of font or style in rich-text documents. In special circumstances, such a variation from the normal range of appearance needs to be expressed side-by-side in the same document in plain-text contexts, where it is impossible or inconvenient to exchange formatted text. For example, in languages employing the Mongolian script, sometimes a specific variant range of glyphs is needed for a specific textual purpose for which the range of “generic” glyphs is considered inappropriate. The variation selectors are used when characters have essentially the same semantic.
Variation selectors provide a mechanism for specifying a restriction on the set of glyphs that are used to represent a particular character. They also provide a mechanism for specifying variants, such as for CJK Ideographs and Mongolian, that have essentially the same semantic but have substantially different ranges of glyphs. A variation sequence, which always consists of a base character followed by the variation selector, may be specified as part of the Unicode Standard. That sequence is referred to as a variant of the base character. The variation selector affects only the appearance of the base character,* and only in the variation sequences defined in this Standard. The variation selector is not used as a general code extension mechanism.
(It goes on...)
You may also be interested in the Standardized Variants (this time from 6.0.0).
This is not a complete answer to the question, but it's pertinent to Emojis and Variant Selectors:
The ❤ character (U+2764 code point) is a Unicode character from 1993.
But the ❤️ emoji is actually the ❤ (U+2764) character followed by the Variant Selector-16 (U+FE0F).
Why?
Exclusively speaking about Emojis (documentation):
VS15 and VS16 are reserved to determine whether or not a character
should be displayed as an emoji. [...]
Emoji variation sequences contain VS16 (U+FE0F) for emoji-style (with color) or VS15 (U+FE0E) for text style (monochrome)
If there is a character (or symbol, glyph, etc...) that is intended to be also a emoji, the Variant Selector-16 will specify to the render, to renders it as Emoji. But if the same character is followed by the Variant Selector-15, it will specify to the render, to renders it as just text. If no Variant Selector is appended, than the default representation will depends on Unicode's specification. For Emoticons the default is Emoji. For other characters like ❤, the default is text...
Another example from Emoticons (Unicode_block)'s documentation:
Each emoticon has two variants:
U+FE0E (VARIATION SELECTOR-15) selects text presentation (e.g. 😊︎ 😐︎ ☹︎)
U+FE0F (VARIATION SELECTOR-16) selects emoji-style (e.g. 😊️ 😐️ ☹️).
If there is no variation selector appended, the default is the
emoji-style. Example:
U+1F610 (NEUTRAL FACE) 😐
U+1F610 (NEUTRAL FACE), U+FE0E (VARIATION SELECTOR-15) 😐︎
U+1F610 (NEUTRAL FACE), U+FE0F (VARIATION SELECTOR-16) 😐️
Note: The VS15 and VS16 are not mandatory to a valid emoji. There are a lot of emoji without Variant Selectors.
Your guess is as good as mine.. but according to this source...
has got it...
Emoji Character Encoding Data Hints: 1 In iOS 5 / OSX 10.7, the underlying code that the Apple OS generates for this emoji was changed.2 The code generated for this emoji was changed slightly in iOS 7 / OSX 10.9 (a variation selector was added) to make it easier for this emoji to be identified and shown in OSX and iOS. We don't mind Apple, thank you! We just love our emojis!
Their chart goes on to note that this "new", post-10.9 version
has a UTF-8 Character Count of 2 vs the previous 1... if that helps.
The Variation Selectors range was introduced with version 3.2 of the Unicode Standard, and is located in Plane 0, the Basic Multilingual Plane. Further selectors can be found in the Variation Selectors Supplement range.
Most Unicode characters can be represented by a wide variety of glyphs, and in rich text a particular glyph can be indicated by choosing a particular font or style. This mechanism is not available in plain text, and so variation selectors have been introduced as a way of indicating that the glyphs applicable to a particular character should be changed or restricted. The base character is followed by the variation selector, the combination being called a variation sequence. This is not intended to be general-purpose mechanism, and the only permitted variation sequences are those defined in the Standardized Variants file, which forms part of the Unicode Character Database.
From http://www.alanwood.net/unicode/variation_selectors.html

What is the range of Unicode Printable Characters?

Can anybody please tell me what is the range of Unicode printable characters? [e.g. Ascii printable character range is \u0020 - \u007f]
See, http://en.wikipedia.org/wiki/Unicode_control_characters
You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes
The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F
other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.
More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.
This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.
Unicode
Unicode defines properties for characters.
One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.
By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.
You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.
Programming Language support
Some programming languages assist with this problem.
For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:
func IsGraphic(r rune) bool
IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such
characters include letters, marks, numbers, punctuation, symbols, and spaces,
from categories L, M, N, P, S, Zs.
func IsPrint(r rune) bool
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.
Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.
Printable
The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.
In particular whether a particular "character" is printable is not always obvious.
Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?
Footnotes
ASCII printable character range is \u0020 - \u007f
No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).
In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.
First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).
Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.
Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.
It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.
What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).
Taking the opposite approach to #HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use not to test if a character is printable.
In the style of Regex (so if you wanted printable characters, place a ^):
[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]
Which accounts for things like separator spaces and joiners
Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes Non-Latin, Language Supplement blocks as 'printable', even though it contains things like 'zero-width non-joiner'..).
Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case, non-breaking spaces should change to space, not be removed, and invisible separator should be replaced with comma conditionally.
Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..
NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via /.../u).
You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range \u{E0100}-\u{E01EF} in javascript:
/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
Without u \u{E0100}-\u{E01EF} equates to \uDB40(\uDD00-\uDB40)\uDDEF, not (\uDB40\uDD00)-(\uDB40\uDDEF), and if replacing you should always enable u even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.
What characters are valid?
At present, Unicode is defined as starting from U+0000 and ending at U+10FFFF. The first block, Basic Latin, spans U+0000 to U+007F and the last block, Supplementary Private Use Area-B, spans U+100000 to 10FFFF. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.
Let's break down what's valid/invalid in the Latin Block1.
The Latin Block: TLDR
If you're interested in filtering out either invisible characters, you'll want to filter out:
U+0000 to U+0008: Control
U+000E to U+001F: Device (i.e., Control)
U+007F: Delete (Control)
U+008D to U+009F: Device (i.e., Control)
The Latin Block: Full Ranges
Here's the Latin block, broken up into smaller sections...
U+0000 to U+0008: Control
U+0009 to U+000C: Space
U+000E to U+001F: Device (i.e., Control)
U+0020: Space
U+0021 to U+002F: Symbols
U+0030 to U+0039: Numbers
U+003A to U+0040: Symbols
U+0041 to U+005A: Uppercase Letters
U+005B to U+0060: Symbols
U+0061 to U+007A: Lowercase Letters
U+007B to U+007E: Symbols
U+007F: Delete (Control)
U+0080 to U+008C: Latin1-Supplement symbols.
U+008D to U+009F: Device (i.e., Control)
U+00A0: Non-breaking space. (i.e., )
U+00A1 to U+00BF: Symbols.
U+00C0 to U+00FF: Accented characters.
The Other Blocks
Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.
Latin1 & Latin1-Related Blocks
U+0000 to U+007F : Basic Latin
U+0080 to U+00FF : Latin-1 Supplement
U+0100 to U+017F : Latin Extended-A
U+0180 to U+024F : Latin Extended-B
Combinable blocks
U+0250 to U+036F: 3 Blocks.
Non-Latin, Language blocks
U+0370 to U+1C7F: 55 Blocks.
Non-Latin, Language Supplement blocks
U+1C80 to U+209F: 11 Blocks.
Symbol blocks
U+20A0 to U+2BFF: 22 Blocks.
Ancient Language blocks
U+2C00 to U+2C5F: 1 Block (Glagolitic).
Language Extensions blocks
U+2C60 to U+FFEF: 66 Blocks.
Special blocks
U+FFF0 to U+FFFF: 1 Block (Specials).
One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.
I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)
I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.
The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.
Unicode, stict term, has no range. Numbers can go infinite.
What you gave is not UTF8 which has 1 byte for ASCII characters.
As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.