The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)
But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?
Are 8-bit characters, that require encoding, preceded by %00?
Or, is the point that unicode characters are supposed to be lost/split?
According to Wikipedia:
Current standard
The generic URI syntax mandates that new URI schemes
that provide for the representation of
character data in a URI must, in
effect, represent characters from the
unreserved set without translation,
and should convert all other
characters to bytes according to
UTF-8, and then percent-encode those
values. This requirement was
introduced in January 2005 with the
publication of RFC 3986. URI schemes
introduced before this date are not
affected.
Not addressed by the current
specification is what to do with
encoded character data. For example,
in computers, character data manifests
in encoded form, at some level, and
thus could be treated as either binary
data or as character data when being
mapped to URI characters. Presumably,
it is up to the URI scheme
specifications to account for this
possibility and require one or the
other, but in practice, few, if any,
actually do.
Non-standard implementations
There exists a non-standard encoding
for Unicode characters: %uxxxx, where
xxxx is a Unicode value represented as
four hexadecimal digits. This behavior
is not specified by any RFC and has
been rejected by the W3C. The third
edition of ECMA-262 still includes an
escape(string) function that uses this
syntax, but also an encodeURI(uri)
function that converts to UTF-8 and
percent-encodes each octet.
So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?
What I've always done is first UTF-8 encode a Unicode string to make it a series of 8-bit characters before escaping any of those with %HH.
P.S. - I can only hope the non-standard implementations (%uxxxx) are few and far between.
Since URI's were introduced before unicode was around, or atleast in wide use, I imagine this is a very implementation specific question. UTF-8 encoding your text, then escaping that per normal sounds like the best idea, since that's completely backwards compatible with any ASCII/ANSI systems in place, though you might get the odd wierd character or two.
On the other end, to decode, you'd unescape your text, and get a UTF-8 string. If someone using an older system tries to send yours some data in ASCII/ANSI, there's no harm done, that's (almost) UTF-8 encoded already.
Related
is ASCII a (encoded) character set or an encoding? Some sources say its an (7-Bit) encoding others say its a character set.
Whats correct?
It's an encoding, that only supports a certain set of characters.
Once upon a time, when computers or operating systems would often only support a single encoding it was sensible to refer to the set of characters it supported as a character set for obvious enough reasons.
From 1963 on, ASCII was a commonly-supported character set, and many other character sets where either variations on it, or 8-bit extensions of it.
But as well as defining a set of characters, it also assigned numerical values, so it was a coded character set.
And since it provides a number to each character it also provides a way to store those characters in sequences of bytes, as long as the byte-size is 7-bits or higher, it hence also defined an encoding.
So ASCII was used both to refer to the set of characters it supported, and the encoding rules by which those characters would be stored digitally.
These days most computers use the Universal Character Set. While there are encodings (UTF-8 and UTF-16 being the most prevalent) that can encode the entire UCS, there remain some uses for legacy encodings like ASCII that can only encode a small number.
So, ASCII can refer both to an encoding and the set of characters it supports, but in remaining modern use (especially in cases where an escape mechanism allows for other characters to be indirectly represented, such as character entity references) it's mostly referred to as an encoding. Conversely though character set (or the abbreviation charset) is sometimes still used to refer to encodings. So in common parlance the two are synonyms, as unfortunate (as technically inaccurate) as that may be.
You could say that ASCII is a character set that has two encodings: a 7-bit one called ASCII and an 8-bit one called ASCII.
The 7-bit one was sometimes paired with a parity bit scheme when text was sent over unreliable transports. Today, error detection and correction is handled on a separate layer so only the 8-bit encoding is used.
Terms change over time as concepts evolve and convolve. "Character" is currently a very ambiguous term. People often mean grapheme when they say character. Or they mean a particular data type in a specific language.
"ASCII" is a genericized brand and leads to a lot of confusion. The ASCII that I've described above is only used in very specialized contexts.
It looks like your question can currently not be answered correctly as "character set" is not defined properly.
https://en.wikipedia.org/wiki/Category:Character_sets
The category of character sets includes articles on specific character encodings (see the article for a precise definition, and for why the term "character set" should not be used).
Edit: in my opintion ascii can only bee seen as an encoding, or better code-page. see for example microsoft listing of codepages:
20127 us-ascii
65001 utf-8
When using UTF-8, which character reference is better, or more widely supported worldwide on various browsers... using decimal references or hex references?
UPDATE
For instance, for replacing quotation marks...
" or "
which one is better to use, and why?
All HTML entities use only the ASCII subset, so the fact that you encode your document in UTF-8, as opposed to any other byte oriented encoding which extends ASCII, is unrelated.
Anyway:
When using UTF-8, you can just copy and paste the relevant characters into the document, without references at all. E.g. StackOverflow does not convert this ⫅ to an entity (see the source of this page).
If you prefer using entities, then I would use the hex references purely since this is the way Unicode codepoints are usually written in the charts. References are so widely supported that I do not think that you will head a compatibility problem with neither hex nor decimal references.
There is no functional difference between decimal references and hexadecimal references. Old browsers did not support the latter, but then we are talking about really old browsers like Netscape 4 and IE 4.
Hexadecimal references are usually more handy, because in character code standards and other reference works, characters are referred to by their code numbers in hexadecimal. Using them, you avoid the conversion from hexadecimal to decimal (and thereby may avoid some mistakes).
There is no reason to use either " or " in text. (In attribute values, they, or ", are needed in rare cases.)
This does not depend on the document encoding (UTF-8 or something else), except in the sense that when using UTF-8, you do not need the references (except for the markup-significant characters < and &). UTF-8 lets you enter any character as such, though you might still use references if you find that more comfortable than finding an editor that lets you enter the characters themselves.
I have a question as to how programs parse strings if they do not a priori know the encoding that is used.
As I understand it, the UTF-8 encoding stores ASII characters with 1 byte, and all other chracters with up to as many as 6 (I think it's 6) bytes. Thus, for example, two spaces would be stored in memory as 0x2020.
How then, would a program be able to determine the difference between this string and the string`0x2020 encoded using the UTF-16 encoding which corresponds to the single character which evidently is a character that appears similar to the symbol sometimes used to denote the adjoint of an operator in mathematics (I just looked that up here).
It seems as if the parser would always have to know the encoding of a string before hand. If so, how is this implemented in practice? Is there a byte preceeding each string which tells the parser what encoding is used or something?
In general, it is not possible to know for certain the exact encoding used based solely on the stream of bytes that can represent text. However, if there is a byte order mark somewhere, you can use it at least as a hint as to what encoding is being used.
But with no hints or some kind of contract/exchange of metadata between the producer and consumer of the text, you can't be 100% sure. You can try using a heuristic, but then you get these kinds of problems if you end up guessing wrong.
If you want to be really sure, set up some kind of protocol or contract between the producer and the consumer of the text so that the text and the encoding scheme is known. You can hardcode the encoding scheme (for example, your program may parse UTF-8 and only UTF-8), or ensure the producer of the text always prepend a byte order mark or specially designed header bytes to communicate the encoding scheme.
Does the language always store strings in a certain encoding so that
the display function could safely assume that the string was encoded,
say, using UTF-8?
In depends on the language.
In C#, yes. A char is defined by the language specification (8.2.1) as a UTF-16 code unit, and thus a string is always UTF-16. Just like Java.
In Ruby 1.9, a string is a byte array with an associated Encoding.
But in pre-Unicode languages like C (and badly-designed post-Unicode languages like PHP), a string is just a byte array with no encoding information. You have to rely on convention. It's a real interesting experience to write a program that uses both a library that assumes UTF-8 strings and another that assumes windows-1252 strings.
A question that's equally relevant to all languages is: How do you determine the encoding of a byte array that contains encoded text? There are several different approaches:
Encoding declarations.
In protocols that use MIME types (notably, SMTP and HTTP), you can declare Content-Type: text/html; charset=UTF-8. In HTML, you can use <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> or the newer <meta charset="UTF-8">. In XML, there's <?xml version="1.0" encoding="UTF-8"?>. In Python source code, there's # -*- coding: UTF-8 -*-.
Unfortunately, such declarations aren't always accurate. And they aren't available at all for locally-stored plain .txt files, so then a different approach must be used.
Byte-order mark (BOM)
Putting the special character U+FEFF at the beginning of a file lets you distinguish between the various UTF encodings.
But it's not usable for legacy encodings like ISO-8859-x or Windows-125x, and not always used with UTF-8.
Validation
Some encodings have strict rules about what makes a valid string. The best-known is UTF-8, with its rigid separation of leading/trailing bytes, prohibition of "overlong" encodings, etc. UTF-32 is even easier to recognize because the restriction of Unicode to 17 "planes" means that every code unit must have the form 00 {00-10} xx xx (or xx xx {00-10} 00 for little-endian).
So if text validates as being UTF-8 or UTF-32, you can safely assume that it is. There's a possibility of false positives, but it's very low.
However, this approach doesn't work well for UTF-16, where the false-positive rate is too high. (The only way for an even-length byte array to not be valid UTF-16 is to contain unpaired surrogates, or U+FFFE or U+FFFF.)
Statistical analysis
Use character frequency tables of various language/encoding combinations. This is the approach used by chardet (in combination with BOM and validation).
Falling back on a default encoding
When all else fails, assume ISO-8859-1, windows-1252, or Encoding.Default.
I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:
Dean Harding: UTF-8 is a
variable-length encoding, which is
more complex to process than a
fixed-length encoding. Also, see my
comments on Gumbo's answer: basically,
combining characters exist in all
encodings (UTF-8, UTF-16 & UTF-32) and
they require special handling. You can
use the same special handling that you
use for combining characters to also
handle surrogate pairs in UTF-16, so
for the most part you can ignore
surrogates and treat UTF-16 just like
a fixed encoding.
I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?
I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!
Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"
Edit2:
I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:
Andrew Russell: For performance:
UTF-8 is much harder to decode than
UTF-16. In UTF-16 characters are
either a Basic Multilingual Plane
character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters can
be anywhere between 1 and 4 bytes
This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!
UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.
If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.
Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.
EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like z̃ as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like fi or ffl where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.
It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).
To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.
I guess what I really mean is
"Why would anyone suggest treating
UTF-16 as fixed encoding when it seems
bogus?"
Two words: Backwards compatibility.
Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".
This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."
But I'm still not convinced this is
any different to assuming UTF-8 is
single-byte characters!
Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().
However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.
You can use the same special handling
that you use for combining characters
to also handle surrogate pairs in
UTF-16
I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.
In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.
>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'
But not surrogates:
>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed
UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.
The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.
As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.
Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)
For that reason, Unicode developed Normalization Forms. It actually considers five different forms:
Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
NFD -- Normalization Form Decomposition
NFKD -- Normalization Form Compatibility Decomposition
NFC -- Normalization Form Canonical Composition
NFKC -- Normalization Form Compatibility Canonical Composition
Although in most situations it does not matter much because long compositions are rare, even in languages that use them.
And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).
Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.
Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.
Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.