Unicode range of Chinese characters (minimum viable) - unicode

I'm making a game that needs to be translated into Chinese. I'm making custom bitmap fonts for Chinese. I need unicode ranges for the Chinese "alphabet". I want to compact this as much as possible. I don't need the symbol for "pen" during the Han dynasty.
I've got this so far:
0x4E00-0x62FF, 0x6300-0x77FF, 0x7800-0x8CFF, 0x8D00-0x9FCC
But it's enormous at 20k characters.
Can I compact this in any way?

Related

Why isn't there a font that contains all Unicode glyphs?

Pretty much as the title says. Rendering all of the unicode format correctly what with composite characters and characters that affect other characters and ligatures is really hard, I understand that. We have fonts that seem to be designed for maximum Unicode symbol support(Symbola, Code2001, others) and specialized fonts for certain planes or character ranges(BabelStone Han, others).
I don't know much about the underlying technical details for fonts. Is there a maximum size? Is it a copyright problem? Is essentially redrawing all ~110,000 extant glyphs too hard? I understand style concerns, but why not fall back to a 'default' font that had glyphs for everything? They're on unicode.org, redrawing them all would be pretty hard work but then you'd have a guaranteed fallback font for everything. If you got rights to some pre-existing fonts you could just composite them and that should help a lot. Such a font would be a great help to humanity and I can't see a good technical reason why it doesn't exist or at least an open-source effort to create it, so I presume an invisible-to-me reason why it can't be done.
What is that reason?
"Why would you even want that?" questions aside, from a programming perspective there's a very simple reason: the OpenType spec only affords an addressable glyph index space of one USHORT, so one font can only support 16 bits worth of glyphs identifiers, or 65,536 glyphs max. (And note the terminology: a "glyph" is not the same as a "character" or "letter")
The current version of Unicode, v8 as of this answer, contains 120,737 assigned code points, or almost twice as many as fit in a modern font (2021 edit: v13 upped this number to 143,859). In fact, Unicode hasn't been able to fit in a modern OpenType font since 2001, with the release of Unicode 3.1, which upped the number of code points from 49,259 to 94,205.
"So what about font collections?" I hear you ask. Why not use multiple fonts and support all unicode that way? Well now, you've just described Adobe's Sans Pro, and Google's Noto (which are the same font).
As for the "how hard can it be": a uniform style for all glyphs in Unicode, across 129 established written scripts on this planet, each with their own typesetting rules? Incredibly hard. You may think fonts are just files with pictures for letters, and someone types a letter, that picture shows up: that is not how fonts work, and isn't how fonts have worked since the late 1980's.
Modern fonts are the typographic equivalent of a game ROM: sure, it's not much use without the hardware or software to run that ROM on, but all the things that actually matter are in the ROM. Similarly, modern fonts contain all the information for typesetting. Not just pictures, they contain the metadata, the metrics, the positioning and substitutions rules for arbitrary sequences, with separate rule sets for each written script that OpenType supports, mandatory and optional ligatures, language-specific character replacements for letters at the start/middle/final position in a word, or in isolation, character repositioning relative to arbitarily complex sequences of other characters either before or after it, arbitrarily complex sequence replacements with other arbitrarily complex sequences, possible bitmap fallbacks for small-point rendering, hinting instructions on how to properly rasterize vector graphics that are inherently not aligned to any particular pixel grid, and more. A modern font is a ridiculously complex application, that a font engine consults to figure out how to typeset sequences of code points.
Making a (set of) Unicode-encompassing font(s) that looks good for all contexts is a vast team effort.
So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001. We can, and do, make font families that cover all of Unicode, but with 129 different scripts all with their own typesetting rules, it's a lot of work, and almost (almost) not worth the effort compared to only covering a subset of all languages.
And as for this:
Such a font would be a great help to humanity and I can't see a good technical reason why it doesn't exist or at least an open-source effort to create it, so I presume an invisible-to-me reason why it can't be done.
Just because you didn't know about them, doesn't mean they don't exist, with millions of people who are familiar with them. They exist =)
They're even open source, go out and thank the people who made them!
There is GNU Unifont. It aims to contain all Unicode, except Apple Emoji.
You will probably find what you are looking for at the following links.
Unicode Character Table
HTML Character Entity References
Huge List of Unicode Symbols
List of Unicode Characters of Category “Other Symbol
This other is funny for particular character since you can draw what you search:
Unicode Character Recognition
Can't enter unicode character with Alt+ even with EnableHexNumpad
Basic Questions
Q: How many characters are in Unicode?
A: The short answer is that as of Version 13.0, the Unicode Standard contains 143,859 characters. The long answer is rather more complicated, because of all the different kinds of characters that people might be interested in counting.
Unicode font
A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet.
Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters (143,859 characters, with Unicode 13.0).
...
No single "Unicode font" includes all the characters defined in the present revision of ISO 10646 (Unicode) standard, as more and more languages and characters are continually added to it, and common font formats cannot contain more than 65,535 glyphs (about half the number of characters encoded in Unicode).
As a result, font developers and foundries incorporate new characters in newer versions or revisions of a font, or in separate auxiliary fonts intended specifically for particular languages.
Enjoy!

What range of unicode characters should be kept in a #font-face web font for a US based website with a US audience?

As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)

Unicode usage in software

I am tormented by the question concerning the usage of Unicode for a long time. Unicode allows to accelerate and simplify the development of software (in terms of globalization), but I am concerned by the following factors:
increased memory and diskspace usage;
reduction of the text processing performance;
Asian languages treated all alike to the detriment of the national specificities.
With the first paragraph of all it is obvious... But I don't know the true or not the others. Is there anyone who is faced with the need to localize software for Asian countries, and is ready to share the experience?
At the moment I try to use the encoding of a narrow profile (cp1251 - for Russia, cp1254 for Turkey, etc.). Will somebody advice on this issue?
The impact on the size of data in bytes is affected by the choice of the Unicode encoding and by the type of data. For example, using UTF-8 (the only useful Unicode encoding on the web), English text has the same size as in 8-bit encodings, except for typographically correct punctuation marks, which may take two bytes each; for Turkish text, any non-Ascii letter is 2 bytes instead of 1 byte; for Russian text, any Cyrillic letter is 2 bytes. In most cases, this does not matter much.
Text processing performance depends on what you do and how you do that. The reasonable expectation is that there is no problem worth worrying about. If processing is fast enough, it hardly matters whether it would be 10% faster using an 8-bit encoding.
Unicode unification has its impact, but surely Asian languages are not treated all alike. The Unicode standard has a lot to say about specific treatment of characters in Asian scripts and languages. If you are referring to the different shapes of CJK characters in different languages, then the usual solution is to use fonts designed for the language used. (In addition, it can in principle at least also be handled within a font, when OpenType fonts are used.)
Check out the official Unicode FAQ. It has a lot to say about issues like these.
The first two points are very much negligible. You'd need to have a very specific use case where the difference in size and performance make a discernible difference that justifies the headaches of mixed encodings.
Regarding the Unihan characters: They are grouped by meaning of the character, but that character may be written slightly differently in different writing systems. This is a problem of properly marking up the language, it's not really an encoding problem. In HTML documents, you can mark the document with lang attributes and/or set specific fonts using CSS which will alter the appearance of the character for the language appropriately. How to handle this correctly depends on the type of software (HTML, desktop app, etc). I'd advise you open a new, detailed question about that.
Increased text size: Yes. Text size may be increased up to 6 times (for UTF-8). But storage for texts nowadays is nothing a big problem.
Reduction of text processing performance: As per my opinion, no. An UTF-8 character may take up to 6 bytes, but when scanning thru' the text, and right at the first byte of an UTF-8 character we already know how many bytes more for to read for it (the current character in scanning). So most likely the scanning performance stays the same as O(n), where 'n' is the length of the text. To keep the best performance, try not to access the characters in a text by index (yeah, this is a down-point for performance). Java string is not effected by random index access to string character because Java string is a series of 2-byte characters.
Asian languages treated all alike to the detriment of the national specificities: Yeah, human languages when presented in text format are all alike, but a letter 'i' of a single stroke or a letter of '長' of 16 strokes is just a character.
Increased text size, and all of the following are actually untrue.
They may be true, for old-school encodings of unicode, such as UTF-16. UTF-8 is not larger, or slower than ASCII for ASCII-only strings, and yet it allows encoding every Unicode code point. UTF-8 is also a de-facto standard of doing Unicode on the marketplace today.
There is an extensive analysis of performance of different Unicode encodings in http://www.utf8everywhere.org, including for the Asian languages.

Romanization of Unicode text

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.
Examples:
Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")
Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")
Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")
It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.
Does such a library exist?
The problem is a lot more complex than you think.
Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.
You can use Unidecode Sharp :
[a C#] port from Python Unidecode that itself port from Perl unidecode.
(there are also PHP and Ruby implementations available)
Usage;
using BinaryAnalysis.UnidecodeSharp;
.......................................
string _Greek="Αλφαβητικός";
MessageBox.Show(_Greek.Unidecode());
string _Japan ="しんばし";
MessageBox.Show(_Japan.Unidecode());
string _Russian ="яйца Фаберже";
MessageBox.Show(_Russian.Unidecode());
I hope, it will be good for you.
I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.
The Unicode Common Locale Data Repository has some transliteration mappings you could use.

unicode code table combination to support most languages

I just coded the first version of an efficient glyph-to-texture function which takes ranges of unicode characters to store into one or more pov2 textures and am searching for information regarding which code charts are used in which language. I know that the Unicode Consortium gives this per glyph, but that would take really long to check out on my own.
I'd like to support as many of European languages, Cyrillic not a necessity
Edit: I can use every Latin chart, but I would like to save space with removing some extended charts such as Latin extended-D. I'm pretty sure that the only ext. I need to represent every character in my languages alphabet (Slovenian) is Latin-1 + Latin EXTENDED A, so I save ~600 characters
thanks
This page might be helpful. Scroll down to the bottom for a list of codepoint ranges.
Found out about some lists.