Understanding binary String delimiters - encoding

I am confused about the difference between encodings that are
represented with \x, like \x68\x65\x6c\x6c\x6f vs.
ones using \u, such as \u0068\u0065\u006c\u006c\u006f.
I've been playing around with https://convertcodes.com/unicode-converter-encode-decode-utf/ and it seems like UTF-16 uses \u and UTF-8 uses \x, but from other sources I've read that \x is not specific to UTF-8 and \u is not specific to UTF-16. What is the difference, can both encodings use both of these delimiters? Furthermore, is the title of this question even correct? Can these be referred to as binary delimiters? Are the example strings (\x68\x65\x6c\x6c\x6f and \u0068\u0065\u006c\u006c\u006f) considered binary Strings, BLOBs, or something else? What is the proper name for these types of Strings?

Everything entirely depends on who interpretes it, and as such implies a minimum of context:
JSON only knows \u (not bound to a specific UTF encoding) and always wants 4 digits for it, doesn't know \x, and String literals must be enclosed in "double quotation marks".
PHP knows \x (expecting 1 or 2 digits) and \u (expecting any number of digits, bound to UTF-8 encoding) only when using "double quotation marks" for String literals.
MySQL knows neither of those escape sequences, and String literals can either be in 'single quotation marks' or "double quotation marks". One must use hexadecimal literals separately. This is unbound to any encoding used.
C++ knows \x (expecting 2 digits), \u (expecting 4 digits) and \U (expecting 8 digits), which in conjunction with a String's literal prefix then has a different outcome as per encoding. String literals are always in "double quotation marks", single character literals always in 'single quotation marks'.
Perl regular expressions know \x (expecting 2 digits) and \N (expecting a codepoint). Different RegEx flavors have different support, some also accept an \x with 4 digits. Mostly \x is bound to the input encoding (sometimes implied with the u modifier for UTF-8).
See also: what is a String literal.

Related

Mapping arbitrary unicode alphanumeric characters to their ascii equivalents

When I encounter arbitrary unicode string, such as in a hashtag, I would like to express only its alphanumeric components in a string of their ascii equivalents. For example,
x='β‚¬π™‹π™–π™©π™§π™žπ™€π™©'
would be rendered as
x='Patriot'
Since I cannot anticipate the unicode that could appear in such strings, I would like the method to be as general as possible. Any suggestions?
The unicodedata.normalize method can translate Unicode code points to a canonical value. Then, run the value through ascii encoding ignoring non-ASCII values for a byte string, then back through ascii decode to get a Unicode string again:
>>> x='β‚¬π™‹π™–π™©π™§π™žπ™€π™©'
>>> ud.normalize('NFKC',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'
If you need to removed accents from letters, but still keep the base letter, use 'NFKD' instead.
>>> x='β‚¬π™‹π™–π™©π™§π™žΓ΄π™©'
>>> ud.normalize('NFKD',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'

Formal definition of a unicode string

I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.
A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)
My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.
"Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)
But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.
(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units."
According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is."
I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!
For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:
(1) Any array of uint8
(2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
(3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
To make things concrete, here are a few examples:
[ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
[ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.
I am just lost with this "standard". Do you have any clear definition?
My feeling was that a unicode string is a sequence of unicode scalar values.
No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.
An application shall only create well-formed strings, of course:
If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.
But the standard also contains some sections on how to deal with ill-formed input sequences.

How to print escaped hexadecimal in a string in C++?

I have questions related to Unicode, printing escaped hexadecimal values in const char*.
From what I have understood, utf-8 includes 2-, 3- or 4-byte characters, ranging from pound symbol to Kanji characters. Within strings these are represented in hexadecimal values using \u as escape sequence. Also I have understood while using hexadecimal escape in a string, the characters whose value can be included in the escape will be included. For example say "abc\x0f0dab" will include 0f0dab to be considered within \x as hex even though you want only 0f0d to be considered.
Now while writing a Unicode string, say you want to write "abcπ€­’defβ‚€ghi", where Unicode for π€­’ is 0x24B62 and β‚€ is 0x00A3. So I will have to compose the string as "abc0x24B62def0x00A3ghi". The 0x will consider all values that can be included in it. So if you want to print "abcπ€­’62" the string will be "abc0x24B6262". Won't the entire string be taken as a 4-byte unicode (0x24B6262) value considered within 0x? How to solve this? How to print "abcπ€­’62" and not abc(0x24B6262)?
I have a string const char* tmp = "abc\x0fdef";. When I print using printf("\n string = %s", tmp); then it will print abcdef. Where is 0f here? I know the decimal value of \x0f will be stored in the string, i.e. 15, so when we try to print, 15 should be printed right? I mean, it should be "abc15def" but it prints only "abcdef".
I think you may be unfamiliar with the concept of encodings, from reading your post.
For instance, you say "unicode of ... β‚€ is 0x00A3". That is true - unicode codepoint U+00A3 is the pound sign. But 0x00A3 is not how you represent the pound sign in, for example, UTF-8 (a particular common encoding of Unicode). Take a look here to see what I mean. As you can see, the UTF-8 encoding of U+00A3 is the two bytes is 0xc2, 0xa3 (in that order).
There are several things that happen between your call to printf() and when something appears on your screen.
First, your program runs the code printf("abc\x0fdef"), and that means that the following bytes in order, are written to stdout for your program:
0x61, 0x62, 0x63, 0x0f, 0x64, 0x65, 0x66
Note: I'm assuming your source code is ASCII (or UTF-8), which is very common. Technically, the interpretation of your source code's character set is implementation-defined, I believe.
Now, in order to see output, you will typically be running this program inside some kind of shell, and it has to eventually transform those bytes into visual characters. It does this by using an encoding. Again, something ASCII-compatible is common, such as UTF-8. On Windows, CP1252 is common.
And if that is the case, you get the following mapping:
0x61 - a
0x62 - b
0x63 - c
0x0f - the 'shift in' ASCII control code
0x64 - d
0x65 - e
0x66 - f
This prints out as "abcdef" because the 'shift in' control code is a non-printing character.
Note: The above can change depending on what exact character sets are involved, but ASCII or UTF-8 is very likely what you're dealing with unless you have an exotic setup.
If you have a UTF-8 compatible terminal, the following should print out "abcβ‚€def", just as an example to get you started:
printf("abc\xc2\xa3def");
Make sense?
Update: To answer the question from your comment: you need to distinguish between a codepoint and the byte values for an encoding of that codepoint.
The Unicode standard defines 'codepoints' which are numerical values for characters. These are commonly written as U+XYZ where XYZ is a hexidecimal value.
For instance, the character U+219e is LEFTWARDS TWO HEADED ARROW.
This might also be written 0x219e. You would know from context that the writer is talking about a codepoint.
When you need to encode that codepoint (to print, or save to file, etc), you use an encoding, such as UTF-8. Note, if you used, for example, the UTF-32 encoding, every codepoint corresponds exactly to the encoded value. So in UTF-32, the codepoint U+219e would indeed be encoded simply as 0x219e. But other encodings will do things differently. UTF-8 will encode U+219e as the three bytes 0xE2 0x86 0x9E.
Lastly, the \x notation is simply how you write arbitrary byte values inside a C/C++ quoted string. If I write, in C source code, "\xff", then that string in memory will be the two bytes 0xff 0x00 (since it automatically gets a null terminator).

Are Unicode and Ascii characters the same?

What exactly are unicode character codes? And how are they different from ascii characters?
Unicode is a way to assign unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. There are many ways to encode Unicode strings as bytes, such as UTF-8 and UTF-16.
ASCII assigns values only to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters).
For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.
In most modern applications you should prefer to use Unicode strings rather than ASCII. This will for example allow you to have users with accented characters in their name or address, and to localize your interface to languages other than English.
The first 128 Unicode code points are the same as ASCII. Then they have a 100,000 or so more.
There are two common formats for Unicode, UTF-8 which uses 1-4 bytes for each value (so for the first 128 characters, UTF-8 is exactly the same as ASCII) and UTF-16, which uses 2 or 4 bytes.

Is Encoding the same as Escaping?

I am interested in theory on whether Encoding is the same as Escaping? According to Wikipedia
an escape character is a character
which invokes an alternative
interpretation on subsequent
characters in a character sequence.
My current thought is that they are different. Escaping is when you place an escape charater in front of a metacharacter(s) to mark it/them as to behave differently than what they would have normally.
Encoding, on the other hand, is all about transforming data into another form, and upon wanting to read the original content it is decoded back to its original form.
Escaping is a subset of encoding: You only encode certain characters by prefixing a special character instead of transferring (typically all or many) characters to another representation.
Escaping examples:
In an SQL statement: ... WHERE name='O\' Reilly'
In the shell: ls Thirty\ Seconds\ *
Many programming languages: "\"Test\" string (or """Test""")
Encoding examples:
Replacing < with < when outputting user input in HTML
The character encoding, like UTF-8
Using sequences that do not include the desired character, like \u0061 for a
They're different, and I think you're getting the distinction correctly.
Encoding is when you transform between a logical representation of a text ("logical string", e.g. Unicode) into a well-defined sequence of binary digits ("physical string", e.g. ASCII, UTF-8, UTF-16). Escaping is a special character (typically the backslash: '\') which initiates a different interpretation of the character(s) following the escape character; escaping is necessary when you need to encode a larger number of symbols to a smaller number of distinct (and finite) bit sequences.
They are indeed different.
You pretty much got it right.