How do you determine the byte width of a UTF-16 character? - unicode

What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).
From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:
Value of leading 2 bytes
Resulting character length (bytes)
0x0000-0xC7FF
2
0xD800-0xDBFF
4
0xDC00-0xDFFF
Invalid sequence (RFC2781 2.2.2)
0xDFFF-0xFFFF
4
In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:
Character
Unicode Name
ISO 10646
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
A
LATIN CAPITAL LETTER A
U+0041
00 41
2
Б
CYRILLIC CAPITAL LETTER BE
U+0411
04 11
2
ァ
KATAKANA LETTER SMALL A
U+30A1
30 A1
2
🐰
RABBIT FACE
U+1F430
D8 3D DC 30
4
However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:
Character
Unicode Name
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
⚕️
STAFF OF AESCULAPIUS
26 95 FE 0F
4
Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).
select cast(N'⚕️' as varbinary);
----------
0x95260FFE
Quite simply, how/why do you read 0x2695 and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?

The formal definition of all of this is called an "extended grapheme cluster," and it's defined in the Unicode Text Segmentation report. As Joachim Sauer notes, it's wise to be careful with the term "character" in Unicode.
Code points are what "U+...." syntax is referring to, and is attempting to capture a "unit" of written language, for example "an acute accent." But what a reader would think of a character (for example "an e with an acute accent") is a "grapheme cluster" and is made up of one or more code points. What is ultimately rendered to the screen is a "glyph" which is both context- and font-dependent.
Grapheme clusters in Unicode are actually more subtle than this. Unicode attempts to define them in a "neutral" way. (There's really no such thing as "neutral" when thinking about languages, but Unicode does try.) For example, in Slovak, ch, dz, and dž are each one letter, but are considered two grapheme clusters in Unicode. (Try to count the "letters" in a Slovak word. There are words that contain the letter dz and other words that have the letter d followed by the letter z. Oh human writing systems. I love you so much.)
The mapping of grapheme clusters to glyphs is also complex. For example, in Arabic, the single glyph لا is actually two grapheme clusters, ل (ARABIC LETTER LAM) followed by ا (ARABIC LETTER ALEF). If you use your mouse to select the glyph, you'll see there are two selectable pieces, and if you copy and paste them to another window you'll see them transform into their component parts. (Just to make thing even more complicated, Unicode also defines a single code point for ligature, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM: ﻻ. If you try to select part of that one, you'll find you can't. It's one "character.")
Your specific case is a bit more special. The Variation Selector predates Unicode, and is mostly designed to handle different variations of Han (Chinese) characters. However, as with every Unicode feature, it eventually has come to be used primarily for emoji. VS-16 is the "emoji" presentation form. The most famous example is the red heart, which is HEAVY BLACK HEART ❤, followed by VS-16: ❤️.
Similarly, your character U+2695 STAFF OF AESCULAPIUS is a single code point, and it looks like this by default (text style): ⚕. When you add VS-16, it is rendered in "emoji style": ⚕️. In some ways it's the same "character." Or is it? Depends on what you're using it for.
Emoji style is typically a bit larger and centered in its block, sometimes adding color. Notice where the period after the staff is drawn in each case (there are no extra spaces in the second example; the glyph is just much wider).
There are other combining systems as well:
U+0031: 1
U+0031 U+20e3: 1⃣ (+ COMBINING ENCLOSING KEYCAP, default text style)
U+0031 U+20e3 U+fe0f: 1⃣️ (+ VARIATION SELECTOR-16, emoji style)
All of these predate Unicode. Modern emoji is dramatically more complicated, and includes several combining systems of its own (including two that are currently just used for flags).
But luckily, to your actual question, your wife is correct, and you can generally just consume all trailing code points that are marked "combining" to form an extended grapheme cluster, and that is kind of a "character" for some broad enough definition of "character."

All of your assertions are completely correct; your interpretation of the UTF-16 standards is correct and complete.
In your empirical observations however, you've assumed that you only have one character. In actuality, you've ran into a nuance of the Unicode implementation. Your "character" is actually two (albeit technically, not visually): U+2695 "STAFF OF AESCULAPIUS" followed by U+FE0F "VARIATION SELECTOR-16". The second character is a non-spacing mark which combines with the base character for the purpose of rendering a character variant.
This results in the byte sequence 26 95 FE 0F, however as you note neither of the words fall within the UTF-16 reserved extension character range. But this is because neither of them require the UTF-16 4 byte extension. They're simply classified as two discrete Unicode characters.
From 7.9 Combining Marks in ISO 10646: Universal Coded Character Set (UCS):,
Combining marks are a special class of characters in the Unicode Standard that are
intended to combine with a preceding character, called their base.
Combining marks usually have a visible glyphic form... a combining mark may interact graphically with neighbouring characters in various ways.
http://unicode.org/L2/L2010/10038-fcd10646-main.pdf
To explain why I'm answering my own question; I had my SO question all ready to fire off. My wife came into my office; after looking over my shoulder she whispered into my ear, "You know combination characters are a thing, right?". I've however still asked the question and answered it myself, in case my wife's sweet nothings help another member of the community.

Related

What is the meaning of the indicator XXX in the Unicode charts

Consider the unicode chart for C1 Controls and Latin-1 supplement in Unicode Charts. If a character has a glyph, it is shown, if it does not have a glyph, a special dotted line and symbolic marker or identifier is given. In this case, both 0080 and 0081 seem to have some "invalid marker", which I think is what "XXX" means. Is that what it means?
Secondly, what should be the behaviour of a Unicode aware string type that has a value stored into the string of value 0x80 (hex) or 128 (decimal)? Should it be converted to some other point, such as the mapping like this:
Byte Value 128 in many ANSI Codepages is the EURO marker.
Storing a 128 decimal value is equivalent to storing U+20AC ?
The magic "non orthogonality" I have encountered in a particular language or operating system API implementation of its MBCS and Unicode types, and Java's interesting handling, leads me to wonder, what is the real intended use of the U+0080 character? This reference link confuses me by showing that Java treats this character as a Euro symbol (ANSI codepage to Unicode one way friendliness) but that it's name is <control> which is not anything I know how to deal with. Wikipedia says it's PAD here
Can anyone help me? Did I skip a foundational concepts day at Unicode School? What am I missing?
Update The block from 0080 to 0098 is non printable control characters. This much I know. What I wonder is what does the XXX mean and how am I to think of this character when I am processing unicode data with this value in it?
According to the explanation in Ch. 17 (About the Code Charts) of the Unicode Standard, p. 573, by the “Dashed Box Convention”, characters that have no visible rendering as such “are represented by a square dashed box. This box surrounds a short mnemonic abbreviation of the character’s name.” The characters referred to in the questions are control characters, in the C1 Controls area.
The Unicode Standard says, in Ch. 16, p. 544, about C0 and C1 Controls: “The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are gen-erally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.” And the abbreviations in the square dashed boxes reflect the meanings given in ISO/IEC 6429:1992.
Some code points in the C1 Controls area are not defined in ISO/IEC 6429:1992. For them, such as U+0080, the code chart has “XXX” in place of a mnemonic abbreviation. So this indicates that the Unicode standard does not refer to any meaning for those code points, beyond their being control characters with some abstract properties.
Thus, “XXX” does not mean “invalid”, but rather “completely undefined meaning”. The meaning of such code points can be defined by various standards or other conventions, as long as they are consistent with the general definitions—e.g., it would be incompatible to define U+0080 as a graphic character.
Such code points must not be replaced or omitted in any character-level processing; applications that actually change data may do whatever they want, but any general conversion routines, for example, must keep these code points (characters) intact. They must not be treated as malformed or invalid; but an application may treat them as undefined. By Unicode principles, it’s OK to be ignorant of a character, but not completely wrong about it.
This has nothing to do with the meaning of bytes like 0x80 in 8-bit codes like Windows-1252. But if you send e.g. data labeled as ISO-8859-1 encoded (where e.g. 0x80 is in principle U+0080) to a web browser, it will actually treat it as Windows-1252 encoded. The reason is that characters like U+0080 are practically never used in ISO-8859-1 data; occurrence of 0x80 in ISO-8859-1 labeled data is virtually always either windows-1252 mislabeled or messed-up data that cannot be meaningfully processed. So browsers take the practical route and treat ISO-8859-1 as windows-1252; this is being formalized in HTML5 and related specifications.

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski

What's the purpose of the noncharacters U+FDD0 to U+FDEF?

U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work.
U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense.
But I can't figure out, and The Unicode Standard doesn't really explain, why the set of noncharacters includes some random block within "Arabic Presentation Forms-A". What are these for? (Besides the eye of the basilisk?)
OK the question is "what are they for" and "Why are they in the middle of the Arabic Presentation Forms".
There was a need for a block of 32 non-characters "to make additional codes available to programmers to use for internal processing purposes" http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=IWS-Chapter04a#4d3110c8
It was required that it be in the Basic Multilingual Plane (BMP), i.e. 0x0000 to 0xFFFF, so that they could have single-codepoint representations in UTF-16.
There was a block of unused codepoints in the Arabic Presentation Forms block.
It had been agreed not to encode any more Arabic Presentation Forms, so these were never going to be used.
http://www.unicode.org/mail-arch/unicode-ml/y2001-m10/0014.html
Therefore it was agreed that these codepoints, which were never going to be used otherwise, would be designated noncharacters so they could be used internally by applications/programmers.
These noncharacters are for internal use by application and should not be interchanged.
I tried to explain based on what is said in Unicode standard.
Unicode got 66 non-characters. For all 17 planes they have two each, last two code points of the plane ending with FFFE FFFF. 32 other no-characters are continuous block U+FDD0 to U+FDEF.
So total count
17*2 + 32 = 66
Read following text from the unicode chapter 16, which says that its in some random place because of "historic reason", I'm curious but I don't think there is any ambiguity.
For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not
"Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any
other way from the other noncharacters, except in their code point values
U+FEFF is BOM and U+FFFE is byte-swapped version of it. But since U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text, It just gives a signal, not a standard way. It can be either of the one, reverse bytes or a wrong text.
In the Unicode section 3.2 clause C2 says
C2 A process shall not interpret a noncharacter code point as an abstract character.
The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
So as application developers you are free to use these characters as you wish. They are used as sentinel or delimter or may be some baslik characters, but they should not be interchanged.
Section 16.7 says
In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to
interpretation by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible application-internal private uses
Again U+FFFF is not reserved as sentinel by Unicode standard but just given the typical use case. Read in section 16.7
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being
associated with the largest code unit values for particular Unicode encoding forms. In
UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16
U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16
This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For
example, they might be used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on
As mentioned here at xkcd, U+FDD0 is actually the Unicode character for the eye of a basilisk. For (obvious) reasons of personal safety however, the character is not rendered to the screen... :)

What issues would come from treating UTF-16 as a fixed 16-bit encoding?

I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:
Dean Harding: UTF-8 is a
variable-length encoding, which is
more complex to process than a
fixed-length encoding. Also, see my
comments on Gumbo's answer: basically,
combining characters exist in all
encodings (UTF-8, UTF-16 & UTF-32) and
they require special handling. You can
use the same special handling that you
use for combining characters to also
handle surrogate pairs in UTF-16, so
for the most part you can ignore
surrogates and treat UTF-16 just like
a fixed encoding.
I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?
I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!
Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"
Edit2:
I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:
Andrew Russell: For performance:
UTF-8 is much harder to decode than
UTF-16. In UTF-16 characters are
either a Basic Multilingual Plane
character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters can
be anywhere between 1 and 4 bytes
This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!
UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.
If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.
Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.
EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like z̃ as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like fi or ffl where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.
It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).
To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.
I guess what I really mean is
"Why would anyone suggest treating
UTF-16 as fixed encoding when it seems
bogus?"
Two words: Backwards compatibility.
Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".
This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."
But I'm still not convinced this is
any different to assuming UTF-8 is
single-byte characters!
Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().
However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.
You can use the same special handling
that you use for combining characters
to also handle surrogate pairs in
UTF-16
I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.
In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.
>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'
But not surrogates:
>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed
UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.
The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.
As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.
Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)
For that reason, Unicode developed Normalization Forms. It actually considers five different forms:
Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
NFD -- Normalization Form Decomposition
NFKD -- Normalization Form Compatibility Decomposition
NFC -- Normalization Form Canonical Composition
NFKC -- Normalization Form Compatibility Canonical Composition
Although in most situations it does not matter much because long compositions are rare, even in languages that use them.
And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).
Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.
Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.

What is the range of Unicode Printable Characters?

Can anybody please tell me what is the range of Unicode printable characters? [e.g. Ascii printable character range is \u0020 - \u007f]
See, http://en.wikipedia.org/wiki/Unicode_control_characters
You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes
The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F
other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.
More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.
This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.
Unicode
Unicode defines properties for characters.
One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.
By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.
You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.
Programming Language support
Some programming languages assist with this problem.
For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:
func IsGraphic(r rune) bool
IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such
characters include letters, marks, numbers, punctuation, symbols, and spaces,
from categories L, M, N, P, S, Zs.
func IsPrint(r rune) bool
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.
Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.
Printable
The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.
In particular whether a particular "character" is printable is not always obvious.
Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?
Footnotes
ASCII printable character range is \u0020 - \u007f
No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).
In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.
First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).
Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.
Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.
It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.
What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).
Taking the opposite approach to #HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use not to test if a character is printable.
In the style of Regex (so if you wanted printable characters, place a ^):
[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]
Which accounts for things like separator spaces and joiners
Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes Non-Latin, Language Supplement blocks as 'printable', even though it contains things like 'zero-width non-joiner'..).
Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case, non-breaking spaces should change to space, not be removed, and invisible separator should be replaced with comma conditionally.
Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..
NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via /.../u).
You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range \u{E0100}-\u{E01EF} in javascript:
/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
Without u \u{E0100}-\u{E01EF} equates to \uDB40(\uDD00-\uDB40)\uDDEF, not (\uDB40\uDD00)-(\uDB40\uDDEF), and if replacing you should always enable u even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.
What characters are valid?
At present, Unicode is defined as starting from U+0000 and ending at U+10FFFF. The first block, Basic Latin, spans U+0000 to U+007F and the last block, Supplementary Private Use Area-B, spans U+100000 to 10FFFF. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.
Let's break down what's valid/invalid in the Latin Block1.
The Latin Block: TLDR
If you're interested in filtering out either invisible characters, you'll want to filter out:
U+0000 to U+0008: Control
U+000E to U+001F: Device (i.e., Control)
U+007F: Delete (Control)
U+008D to U+009F: Device (i.e., Control)
The Latin Block: Full Ranges
Here's the Latin block, broken up into smaller sections...
U+0000 to U+0008: Control
U+0009 to U+000C: Space
U+000E to U+001F: Device (i.e., Control)
U+0020: Space
U+0021 to U+002F: Symbols
U+0030 to U+0039: Numbers
U+003A to U+0040: Symbols
U+0041 to U+005A: Uppercase Letters
U+005B to U+0060: Symbols
U+0061 to U+007A: Lowercase Letters
U+007B to U+007E: Symbols
U+007F: Delete (Control)
U+0080 to U+008C: Latin1-Supplement symbols.
U+008D to U+009F: Device (i.e., Control)
U+00A0: Non-breaking space. (i.e., )
U+00A1 to U+00BF: Symbols.
U+00C0 to U+00FF: Accented characters.
The Other Blocks
Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.
Latin1 & Latin1-Related Blocks
U+0000 to U+007F : Basic Latin
U+0080 to U+00FF : Latin-1 Supplement
U+0100 to U+017F : Latin Extended-A
U+0180 to U+024F : Latin Extended-B
Combinable blocks
U+0250 to U+036F: 3 Blocks.
Non-Latin, Language blocks
U+0370 to U+1C7F: 55 Blocks.
Non-Latin, Language Supplement blocks
U+1C80 to U+209F: 11 Blocks.
Symbol blocks
U+20A0 to U+2BFF: 22 Blocks.
Ancient Language blocks
U+2C00 to U+2C5F: 1 Block (Glagolitic).
Language Extensions blocks
U+2C60 to U+FFEF: 66 Blocks.
Special blocks
U+FFF0 to U+FFFF: 1 Block (Specials).
One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.
I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)
I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.
The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.
Unicode, stict term, has no range. Numbers can go infinite.
What you gave is not UTF8 which has 1 byte for ASCII characters.
As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.