Character Sets, Locales, Fonts and Code Pages? - unicode

I can't figure out the relation between those terms. I actually need a brief explanation for each and eventually a relation between them.
Moreover, where do all these stuff reside? Where are they implemented? Is it the job of the operating system to manage the aforementioned terms? If not, then who is responsible for this job?

Character Sets describe the relationship between character codes and characters.
As an example, all (extended) ASCII character sets assign 41hex == 65dec to A.
Common character sets are ASCII, Unicode (UTF-8, UTF-16), Latin-1 and Windows-1252.
Code Pages are a a representation of character sets / a mechanism for selecting which character set is in use: There are the old DOS/computer manufacturer codepages and the legacy support for them in Windows, ANSI-Codepage (ANSI is not to blame here) and OEM-Codepage.
If you have a choice, avoid them like the plague and go for unicode, preferably UTF-8, though UTF-16 is an acceptable choice for the OS-facing part in Windows.
Locales are collections of all the information needed to conform to local conventions for display of information. On systems which neither use code pages nor have a system-defined universal character set, they also determine the character set used (e.g. Unixoids).
Fonts are the graphics and ancillary information neccessary for displaying text of a known encoding. Examples are "Times New Roman", "Verdana", "Arial" and "WingDings".
Not all Fonts have symbols for all characters present in any specific character set.

Related

Will precluding surrogate code points also impede entering Chinese characters?

I have a name input field in an app and would like to prevent users from entering emojis. My idea is to filter for any characters from the general categories "Cs" and "So" in the Unicode specification, as this would prevent the bulk of inappropriate characters but allow most characters for writing natural language.
But after reading the spec, I'm not sure if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points. (My understanding is still rough.)
Would excluding surrogates still leave most Chinese users with the characters they need to enter their names, or is the original Unicode space not big enough for that to be a reasonable expectation?
Your method would be both ineffective and too excessive.
Not all emoji are outside of the Basic Multilingual Plane (and thus don’t require surrogates in the first place), and not all emoji belong to the general category So. Filtering out only these two groups of characters would leave the following emoji intact:
#️⃣ *️⃣ 0️⃣ 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ 6️⃣ 7️⃣ 8️⃣ 9️⃣ ‼️ ⁉️ ℹ️ ↔️ ◼️ ◻️ ◾️ ◽️ ⤴️ ⤵️ 〰️ 〽️
At the same time, this approach would also exclude about 79,000 (and counting) non-emoji characters covering several dozen scripts – many of them historic, but some with active user communities. The majority of all Han (Chinese) characters for instance are encoded outside the BMP. While most of these are of scholarly interest only, you will need to support them regardless especially when you are dealing with personal names. You can never know how uncommon your users’ names might be.
This whole ordeal also hinges on the technical details of your app. Removing surrogates would only work if the framework you are using encodes strings in a format that actually employs surrogates (i.e. UTF-16) and if your framework is simultaneously not aware of how UTF-16 really works (as Java or JavaScript are, for example). Surrogates are never treated as actual characters; they are exceptionally reserved codepoints that exist for the sole purpose of allowing UTF-16 to deal with characters in the higher planes. Other Unicode encodings aren’t even allowed to use them at all.
If your app is written in a language that either uses a different encoding like UTF-8 or is smart enough to process surrogates correctly, then removing Cs characters on input is never going to have any effect because no individual surrogates are ever being exposed to your program. How these characters are entered by the user does not matter because all your app gets to see is the finished product (the actual character codepoints).
If your goal is to remove all emoji and only emoji, then you will have to put a lot of effort into designing your code because the Unicode emoji spec is incredibly convoluted. Most emoji nowadays are constructed out of multiple characters, not all of which are categorised as emoji by themselves. There is no easy way to filter out just emoji from a string other than maintaining an explicit list of every single official emoji which would need to be steadily updated.
Will precluding surrogate code points also impede entering Chinese characters? […] if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points.
You cannot intercept how characters are entered, whether via input method editor, copy-paste or dozens of other possibilities. You only get to see a character when it is completed (and an IME's work is done), or depending on the widget toolkit, even only after the text has been submitted. That leaves you with validation. Let's consider a realistic case. From Unihan_Readings.txt 12.0.0 (2018-11-09):
U+20009 ‹𠀉› (the same as U+4E18 丘) a hill; elder; empty; a name
U+22218 ‹𢈘› variant of 鹿 U+9E7F, a deer; surname
U+22489 ‹𢒉› a surname
U+224B9 ‹𢒹› surname
U+25874 ‹𥡴› surname
Assume the user enters 𠀉, then your unnamed – but hopefully Unicode compliant – programming language must consider the text on the grapheme level (1 grapheme cluster) or character level (1 character), not the code unit level (surrogate pair 0xD840 0xDC09). That means that it is okay to exclude characters with the Cs property.

ASCII - (encoded) character set or character encoding

is ASCII a (encoded) character set or an encoding? Some sources say its an (7-Bit) encoding others say its a character set.
Whats correct?
It's an encoding, that only supports a certain set of characters.
Once upon a time, when computers or operating systems would often only support a single encoding it was sensible to refer to the set of characters it supported as a character set for obvious enough reasons.
From 1963 on, ASCII was a commonly-supported character set, and many other character sets where either variations on it, or 8-bit extensions of it.
But as well as defining a set of characters, it also assigned numerical values, so it was a coded character set.
And since it provides a number to each character it also provides a way to store those characters in sequences of bytes, as long as the byte-size is 7-bits or higher, it hence also defined an encoding.
So ASCII was used both to refer to the set of characters it supported, and the encoding rules by which those characters would be stored digitally.
These days most computers use the Universal Character Set. While there are encodings (UTF-8 and UTF-16 being the most prevalent) that can encode the entire UCS, there remain some uses for legacy encodings like ASCII that can only encode a small number.
So, ASCII can refer both to an encoding and the set of characters it supports, but in remaining modern use (especially in cases where an escape mechanism allows for other characters to be indirectly represented, such as character entity references) it's mostly referred to as an encoding. Conversely though character set (or the abbreviation charset) is sometimes still used to refer to encodings. So in common parlance the two are synonyms, as unfortunate (as technically inaccurate) as that may be.
You could say that ASCII is a character set that has two encodings: a 7-bit one called ASCII and an 8-bit one called ASCII.
The 7-bit one was sometimes paired with a parity bit scheme when text was sent over unreliable transports. Today, error detection and correction is handled on a separate layer so only the 8-bit encoding is used.
Terms change over time as concepts evolve and convolve. "Character" is currently a very ambiguous term. People often mean grapheme when they say character. Or they mean a particular data type in a specific language.
"ASCII" is a genericized brand and leads to a lot of confusion. The ASCII that I've described above is only used in very specialized contexts.
It looks like your question can currently not be answered correctly as "character set" is not defined properly.
https://en.wikipedia.org/wiki/Category:Character_sets
The category of character sets includes articles on specific character encodings (see the article for a precise definition, and for why the term "character set" should not be used).
Edit: in my opintion ascii can only bee seen as an encoding, or better code-page. see for example microsoft listing of codepages:
20127 us-ascii
65001 utf-8

What is this character: 🔖 ? Where can I see the similar characters?

🔖
I am not sure whether everyone can see the above character, but I can see it. I got it when I input "booknote" in Chinese on my iPhone. To my surprise, this character seems "platform-insensative", it can be seen on my phones, chrome on laptop, and even in MacOS terminal.
Is it an ASCII character? I've never seen colorful characters like this before. Since when these have been around? And where I can get a list of similar characters?
Here: http://www.unicode.org/charts/nameslist/index.html
You put the character on an HTML page. All characters on an HTML page are from the Unicode character set. Characters that are not in the Unicode character set either soon will be or are too specialized to be of general use.
The Unicode Consortium occasionally publishes a new version of the character set. Since you ask about the kind of character, the common partitions of the character set are blocks, categories, and—stretching a bit—which version the character was added in. Some characters are in a script (for a language writing system), some are not. You see the block and category of 🔖 at http://www.fileformat.info/info/unicode/char/1f516/index.htm.
The Unicode character set is published in text files called the Unicode Character Database (UCD), as well as many supplementary documents and webpages. The data includes important information about usage and relationships. For example, for applicable characters, which character is considered the uppercase form of another in a particular language.
To see any character, you have to use a font that presents it. This can be a problem for some characters. There is probably no one font that presents every Unicode character as it was meant to be.
You mentioned ASCII. Although it used every day in HTTP headers and other specialized and historical applications, ASCII is such a limited character set that it hasn't generally been used in decades.

Unicode usage in software

I am tormented by the question concerning the usage of Unicode for a long time. Unicode allows to accelerate and simplify the development of software (in terms of globalization), but I am concerned by the following factors:
increased memory and diskspace usage;
reduction of the text processing performance;
Asian languages treated all alike to the detriment of the national specificities.
With the first paragraph of all it is obvious... But I don't know the true or not the others. Is there anyone who is faced with the need to localize software for Asian countries, and is ready to share the experience?
At the moment I try to use the encoding of a narrow profile (cp1251 - for Russia, cp1254 for Turkey, etc.). Will somebody advice on this issue?
The impact on the size of data in bytes is affected by the choice of the Unicode encoding and by the type of data. For example, using UTF-8 (the only useful Unicode encoding on the web), English text has the same size as in 8-bit encodings, except for typographically correct punctuation marks, which may take two bytes each; for Turkish text, any non-Ascii letter is 2 bytes instead of 1 byte; for Russian text, any Cyrillic letter is 2 bytes. In most cases, this does not matter much.
Text processing performance depends on what you do and how you do that. The reasonable expectation is that there is no problem worth worrying about. If processing is fast enough, it hardly matters whether it would be 10% faster using an 8-bit encoding.
Unicode unification has its impact, but surely Asian languages are not treated all alike. The Unicode standard has a lot to say about specific treatment of characters in Asian scripts and languages. If you are referring to the different shapes of CJK characters in different languages, then the usual solution is to use fonts designed for the language used. (In addition, it can in principle at least also be handled within a font, when OpenType fonts are used.)
Check out the official Unicode FAQ. It has a lot to say about issues like these.
The first two points are very much negligible. You'd need to have a very specific use case where the difference in size and performance make a discernible difference that justifies the headaches of mixed encodings.
Regarding the Unihan characters: They are grouped by meaning of the character, but that character may be written slightly differently in different writing systems. This is a problem of properly marking up the language, it's not really an encoding problem. In HTML documents, you can mark the document with lang attributes and/or set specific fonts using CSS which will alter the appearance of the character for the language appropriately. How to handle this correctly depends on the type of software (HTML, desktop app, etc). I'd advise you open a new, detailed question about that.
Increased text size: Yes. Text size may be increased up to 6 times (for UTF-8). But storage for texts nowadays is nothing a big problem.
Reduction of text processing performance: As per my opinion, no. An UTF-8 character may take up to 6 bytes, but when scanning thru' the text, and right at the first byte of an UTF-8 character we already know how many bytes more for to read for it (the current character in scanning). So most likely the scanning performance stays the same as O(n), where 'n' is the length of the text. To keep the best performance, try not to access the characters in a text by index (yeah, this is a down-point for performance). Java string is not effected by random index access to string character because Java string is a series of 2-byte characters.
Asian languages treated all alike to the detriment of the national specificities: Yeah, human languages when presented in text format are all alike, but a letter 'i' of a single stroke or a letter of '長' of 16 strokes is just a character.
Increased text size, and all of the following are actually untrue.
They may be true, for old-school encodings of unicode, such as UTF-16. UTF-8 is not larger, or slower than ASCII for ASCII-only strings, and yet it allows encoding every Unicode code point. UTF-8 is also a de-facto standard of doing Unicode on the marketplace today.
There is an extensive analysis of performance of different Unicode encodings in http://www.utf8everywhere.org, including for the Asian languages.

What's the difference between an "encoding," a "character set," and a "code page"?

I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.
I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a LOT -- even in that article. I think at least two of them are talking about the same thing.
I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.
A ‘character set’ is just what it says: a properly-specified list of distinct characters.
An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.
UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).
The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.
A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.
When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.
[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]
A Character Set is just that, a set of characters that can be used.
Each of these characters is mapped to an integer called code point.
How these code points are represented in memory is the encoding. An encoding is just a method to transform a code-point (U+0041 - Unicode code-point for the character 'A') into raw data (bits and bytes).
I thought Joel's article was pretty much spot on - it is the history behind the evolution of character sets and storage which has brought this about.
FWIW, in my oversimplistic view
Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations
Encoding would relate to the efficient storage of characters, ANSI, UTF-7, UTF-8 etc, for file, across the wire etc
Code Page would be the 'kluge' needed when the demand for the addition of new characters (without wanting to increase storage capacity) meant that (certain) characters were only knowable in the additional context of a code page.
IMHO Wikipedia currently doesn't help things by defining code page as 'another name for character encoding'
and redirecting 'character set' to 'character encoding'
A character set is a set of characters, i.e. "glyphs" i.e. visual symbols representing units of communication. The letter a is a glyph and so is € (euro sign). Character sets usually map integers (codepoints) to each character, but it's the encoding that dictates the binary/byte-level representation of the character.
I'm a ruby programmer, so here are some examples to help you understand the concepts.
This reveals how the Unicode character set maps codepoints to characters, but not how each byte is stored. (ruby 1.9 defaults to Unicode strings.)
>> 'a'.codepoints.to_a
=> [97]
>> '€'.codepoints.to_a
=> [8364]
Since 8364 (base 10) is too large to fit in one byte, various encoding strategies exist to specify a translation from Unicode codepoints into one or many bytes. The UTF-8 encoding is probably the most popular of these encodings. (Wikipedia shows the UTF-8 encoding algorithm, if you want to delve into the implementation.) Note that the UTF-8 encoding only makes sense in the context of the Unicode character set.
The following reveals how the UTF-8 encoding stores each Unicode character as bytes (0 thru 255 in base-10). (Ruby 1.9's default encoding is UTF-8.)
>> 'a'.bytes.to_a
=> [97]
>> '€'.bytes.to_a
=> [226, 130, 172]
Here's the same thing in ISO-8859-15 character set:
>> 'a'.encode('iso-8859-15').codepoints.to_a
=> [97]
>> '€'.encode('iso-8859-15').codepoints.to_a
=> [164]
And the ISO-8859-15 encoding:
>> 'a'.encode('iso-8859-15').bytes.to_a
=> [97]
>> '€'.encode('iso-8859-15').bytes.to_a
=> [164]
Notice that the ISO-8859-15 codepoints match the byte representation.
Here's a blog entry that might be helpful: http://graysoftinc.com/character-encodings/what-is-a-character-encoding. Entries 1 thru 3 are good if you don't want to get too ruby-specific.
The chapter on Unicode in this book, Advanced Perl Programming contains the best description of encoding, character sets and the other entities of unicode that I've come across. Unfortunately I don't think its available for free on line.