how to calculate URL encoding for characters outside the ASCII character set? - unicode

I know that for ASCII characters the URL encoding is just a percentage sign and a hex number that corresponds to the character. But for characters outside that range, hex encoding consists of two or more %hex-number sequences.
For example, for the character that corresponds to hex value 56CE, URL encoding, according to standard .net/java APIs is not %56CE but "%e5%9b%8e"
So if we know the hex value for a character outside the ASCII character range, how is the URL encoding calculated? In other words, how does e5, 9b, 8e come out of 56CE? I tried converting to binary and did see a pattern for the last 2 numbers (%9b, %8e) but have no idea where the %e5 comes from.

You have to encode the Unicode codepoints into charset bytes first, and then you can url-encode those bytes. In your example, E5 9B 8E are the UTF-8 encoded bytes of Unicode codepoint U+56CE, and then %E5%9B%8E is the url encoded form of the UTF-8 bytes.

Related

UTF16 BIG ENDIAN to UTF8 conversion for failed for 0xdcf0

I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.
Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.
In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.

Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?

Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)

Are Unicode and Ascii characters the same?

What exactly are unicode character codes? And how are they different from ascii characters?
Unicode is a way to assign unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. There are many ways to encode Unicode strings as bytes, such as UTF-8 and UTF-16.
ASCII assigns values only to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters).
For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.
In most modern applications you should prefer to use Unicode strings rather than ASCII. This will for example allow you to have users with accented characters in their name or address, and to localize your interface to languages other than English.
The first 128 Unicode code points are the same as ASCII. Then they have a 100,000 or so more.
There are two common formats for Unicode, UTF-8 which uses 1-4 bytes for each value (so for the first 128 characters, UTF-8 is exactly the same as ASCII) and UTF-16, which uses 2 or 4 bytes.

NSString unicode encoding problem

I'm having problems converting the string to something readable . I'm using
NSString *substring = [NSString stringWithUTF8String:[symbol.data cStringUsingEncoding:NSUTF8StringEncoding]];
but I can't convert \U7ab6\U51b1 into '
It shows as 窶冱 which is what I don't want, it should show as an '. Can anyone help me?
it is shown as a ’
That's character U+2019 RIGHT SINGLE QUOTATION MARK.
What has happened is you've had the character sequence ’s submitted to you, in the UTF-8 encoding, which comes out as bytes:
’ s
E2 80 99 73
That byte sequence has then, incorrectly, been interpreted as if it were encoded in Windows code page 932 (Japanese; more or less Shift-JIS):
E2 80 99 73
窶 冱
So in this one particular case, you could recover the ’s string by firstly encoding the characters into cp932 bytes, and then decoding those bytes back to characters using UTF-8.
However, this will not solve your real problem, which is that the strings were read in incorrectly in the first place. You got 窶冱 in this case because the UTF-8 byte sequence resulting from encoding ’s happened also to be a valid Shift-JIS byte sequence. But that won't be the case for all possible UTF-8 byte sequences you might get. Many other characters will be unrecoverably mangled.
You need to find where bytes are being read into the system and decoded as Shift-JIS, and fix that to use UTF-8 instead.

utf8 and encoding

I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.