As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)
I am a developer who is working with Chinese characters. I am trying to convert part of my project into English. I am currently rewriting the project's internationalization module.
I am unfamiliar with the standards for English, so I don't know if non-ascii is used widely?
If it is: Tell me some characters they use frequently.
Standard English spelling uses en dash (–), curly quotation marks (“, ”, ‘, ’); American English also uses em dash (—). Depending on conventions and preferences, several non-Ascii letters may be used, too, especially in words of French or Latin origin, such as é, ë, ç, and æ. Moreover, even in nonspecialized texts, various special character such as superscript two (²), micro sign (µ), and degree sign (°) may be seen.
I'm giving a tech talk about Unicode and encoding in my company, in which I'm trying to make the point that strings are always encoded, and developers should never carelessly assume that everything is 0-127 ASCII.
I have numerous examples of problems caused by mis-encoded text, but I didn't find any example of simple English text with numbers that's encoded above Unicode code point 127.
The basic English alphabet is mapped in Unicode to the same numerical value as the plain old ASCII: The range A-Z is mapped to [65-90] (or [0x41-0x5a] in hex), and [a-z] is mapped to [97-122] (hex [0x61-0x7a]).
Does the English alphabet appear elsewhere in the code charts? I do not mean circumflex letters or other Latin variants, just the plain English alphabet.
CJK characters are generally monospaced in all fonts, since that's how those languages tend to be written.
When mixing CJK and English characters, however, you run into a problem: ASCII characters do not in general have the width of a CJK character. This means that if you use ASCII, you lose the monospaced property - which may not always be desirable.
For this purpose, fullwidth characters (U+FF00-FFEE, Wikipedia, Unicode code chart) may be used in place of "regular" characters. These have the property that they have the same width as a single CJK character.
Note, however, that fullwidth characters are virtually never used outside of a CJK context, and even in those contexts, plain ASCII is frequently used as well, when monospacing is considered unimportant.
Plenty of punctuation and symbols have code point values above U+007F:
“Hello.”
He had been given the comprehensive sixty-four-crayon Crayola box—including the gold and silver crayons—and would not let me look.
x ≠ y
The above examples use:
U+201C and U+201D — smart quotes
U+2014 — em-dash
U+2260 — not equal to
See the Unicode charts for more.
Well, if you just mean a-z and A-Z then no, there are no English characters above 127. But words like fiancé, resumé etc are sometimes spelled like that in English and use codepoints above 127.
Then there are various punctuation signs, currency symbols and so on that are above 127. Not sure if this counts as simple English text.
Is anyone aware of a way to add diacritics from different unicode blocks to say, latin letters (or latin diacritics to say, Devanagari letters)? For instance:
Oै
I tried the zero-width-joiner in between, but it had no effect. Any ideas?
I know, for instance, that the Arabic combining diacritics will work on latin letters, but Hebrew will not. Is this random?
Accoding to the Unicode Standard, Chapter 2, Section 2.11, “All combining characters can be applied to any base character and can, in principle, be used with any script.” So the Latin letter O followed by the Devanagari vowel sign ai U+0948 is permitted. But the standard adds: “This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.”
So it is up to implementations. But there are some “cross-script” diacritics. For example, the acute accent has been unified with the Greek tonos mark, so the Latin letter é and the Greek letter έ, when decomposed, contain the same diacritic U+0301. Moreover, this combining mark can be placed after a Cyrillic letter, and this can be regarded as normal (though relatively rare) usage, so we can expect good implementations to render it properly.
Worked fine for me. I just typed in the characters. Probably depends on the program rendering the text.
Oै
U+4E00..U+9FFF is part of the complete set, but not all
The definitive list can be found at Unicode Character Code Charts; search the page for "CJK".
The "East Asian Script" document does mention:
Blocks Containing Han Ideographs
Han ideographic characters are found in five main blocks of the Unicode Standard, as
shown in Table 18-1
Table 18-1. Blocks Containing Han Ideographs
Block Range Comment
CJK Unified Ideographs 4E00-9FFF Common
CJK Unified Ideographs Extension A 3400-4DBF Rare
CJK Unified Ideographs Extension B 20000-2A6DF Rare, historic
CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E 2B820–2CEAF Rare, historic
CJK Unified Ideographs Extension F 2CEB0–2EBEF Rare, historic
CJK Unified Ideographs Extension G 30000–3134F Rare, historic
CJK Unified Ideographs Extension H 31350–323AF Rare, historic
CJK Compatibility Ideographs F900-FAFF Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800-2FA1F Unifiable variants
Note: this table is current as of Unicode 15.0. The block ranges can evolve over time: latest is in CJK Unified Ideographs.
There are also
CJK Radicals / Kangxi Radicals 2F00–2FDF
CJK Radicals Supplement 2E80–2EFF
which contain characters which may find their way into regular text, as well as
CJK Symbols and Punctuation 3000–303F
See also Wikipedia:
CJK Unified Ideographs Extension A
CJK Unified Ideographs Extension B
CJK Unified Ideographs Extension C
CJK Unified Ideographs Extension D
CJK Unified Ideographs Extension E
CJK Unified Ideographs Extension F (Unicode 10)
See also Unihan Database (which organizes information relating to the properties of CJK Unified Ideographs)
Unicode currently has 74605 CJK characters. CJK characters not only includes characters used by Chinese, but also Japanese Kanji, Korean Hanja, and Vietnamese Chu Nom. Some CJK characters are not Chinese characters.
1) 20941 characters from the CJK Unified Ideographs block.
Code points U+4E00 to U+9FCC.
U+4E00 - U+62FF
U+6300 - U+77FF
U+7800 - U+8CFF
U+8D00 - U+9FCC
2) 6582 characters from the CJKUI Ext A block.
Code points U+3400 to U+4DB5. Unicode 3.0 (1999).
3) 42711 characters from the CJKUI Ext B block.
Code points U+20000 to U+2A6D6. Unicode 3.1 (2001).
U+20000 - U+215FF
U+21600 - U+230FF
U+23100 - U+245FF
U+24600 - U+260FF
U+26100 - U+275FF
U+27600 - U+290FF
U+29100 - U+2A6DF
3) 4149 characters from the CJKUI Ext C block.
Code points U+2A700 to U+2B734. Unicode 5.2 (2009).
4) 222 characters from the CJKUI Ext D block.
Code points U+2B740 to U+2B81D. Unicode 6.0 (2010).
5) CJKUI Ext E block.
Coming soon
If the above is not spaghetti enough, take a look at known issues. Have fun =)
The exact ranges for Chinese characters (except the extensions) are [\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD].
[\u2e80-\u2fd5]
CJK Radicals Supplement is a Unicode block containing alternative,
often positional, forms of the Kangxi radicals. They are used headers
in dictionary indices and other CJK ideograph collections organized by
radical-stroke.
[\u3190-\u319f]
Kanbun is a Unicode block containing annotation characters used in
Japanese copies of classical Chinese texts, to indicate reading order.
[\u3400-\u4DBF]
CJK Unified Ideographs Extension-A is a Unicode block containing rare
Han ideographs.
[\u4E00-\u9FCC]
CJK Unified Ideographs is a Unicode block containing the most common
CJK ideographs used in modern Chinese and Japanese.
[\uF900-\uFAAD]
CJK Compatibility Ideographs is a Unicode block created to contain Han
characters that were encoded in multiple locations in other
established character encodings, in addition to their CJK Unified
Ideographs assignments, in order to retain round-trip compatibility
between Unicode and those encodings.
For the details please refer to here, and the extensions are provided in other answers.
Unicode version 11.0.0
In Unicode the Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters.
These ranges often contain non-assigned or reserved code points(such as U+2E9A , U+2EF4 - 2EFF),
Chinese characters
bottom top reference (also have a look at wiki page) block name
4E00 9FEF http://www.unicode.org/charts/PDF/U4E00.pdf CJK Unified Ideographs
3400 4DBF http://www.unicode.org/charts/PDF/U3400.pdf CJK Unified Ideographs Extension A
20000 2A6DF http://www.unicode.org/charts/PDF/U20000.pdf CJK Unified Ideographs Extension B
2A700 2B73F http://www.unicode.org/charts/PDF/U2A700.pdf CJK Unified Ideographs Extension C
2B740 2B81F http://www.unicode.org/charts/PDF/U2B740.pdf CJK Unified Ideographs Extension D
2B820 2CEAF http://www.unicode.org/charts/PDF/U2B820.pdf CJK Unified Ideographs Extension E
2CEB0 2EBEF https://www.unicode.org/charts/PDF/U2CEB0.pdf CJK Unified Ideographs Extension F
3007 3007 https://zh.wiktionary.org/wiki/%E3%80%87 in block CJK Symbols and Punctuation
In CJK Unified Ideographs block, I notice many answers use upper bound 9FCC, but U+9FCD(鿍) is indeed a Chinese char. And all characters in this block are Chinese characters (also used in Japanese or Korean etc.).
Most of characters in CJK Unified Ideographs Ext (Except Ext F, only 17% in Ext F are Chinese characters), are traditional Chinese characters, which are rarely used in China.
〇 is the Chinese character form of zero and still in use today
Therefore the range is
[0x3007,0x3007],[0x3400,0x4DBF],[0x4E00,0x9FEF],[0x20000,0x2EBFF]
CJK characters but never used in Chinese
They are Common Han used only for compatibility.
It is almost impossible to see them appear in any Chinese books, articles, writings etc.
All characters here have one corresponding glyph-identical Chinese character,
such as 金(U+F90A) and 金(U+91D1), they are identical glyphs.
F900 FAFF https://www.unicode.org/charts/PDF/UF900.pdf CJK Compatibility Ideographs
2F800 2FA1F https://www.unicode.org/charts/PDF/U2F800.pdf CJK Compatibility Ideographs Supplement
CJK related symbols
2E80 2EFF http://www.unicode.org/charts/PDF/U2E80.pdf CJK Radicals Supplement
2F00 2FDF http://www.unicode.org/charts/PDF/U2F00.pdf Kangxi Radicals
2FF0 2FFF https://unicode.org/charts/PDF/U2FF0.pdf Ideographic Description Character
3000 303F https://www.unicode.org/charts/PDF/U3000.pdf CJK Symbols and Punctuation
3100 312f https://unicode.org/charts/PDF/U3100.pdf Bopomofo
31A0 31BF https://unicode.org/charts/PDF/U31A0.pdf Bopomofo Extended
31C0 31EF http://www.unicode.org/charts/PDF/U31C0.pdf CJK Strokes
3200 32FF https://unicode.org/charts/PDF/U3200.pdf Enclosed CJK Letters and Months
3300 33FF https://unicode.org/charts/PDF/U3300.pdf CJK Compatibility
FE30 FE4F https://www.unicode.org/charts/PDF/UFE30.pdf CJK Compatibility Forms
FF00 FFEF https://www.unicode.org/charts/PDF/UFF00.pdf Halfwidth and Fullwidth Forms
1F200 1F2FF https://www.unicode.org/charts/PDF/U1F200.pdf Enclosed Ideographic Supplement
some blocks such as Hangul Compatibility Jamo are excluded because
of no relation to Chinese.
Kangxi Radicals is not Chinese characters, they are graphical components of Chinese characters, used specially to express radicals,
.e.g. ⼻(U+2F3B) and 彳(U+5F73), ⻜(U+2EDC) and 飞 (U+98DE)
Other common punctuation appearing in Chinese
This is a wide range, some punctuation may be never used, some punctuations such as ……”“ are used so much in Chinese.
0000 007F https://unicode.org/charts/PDF/U0000.pdf C0 Controls and Basic Latin
2000 206F https://unicode.org/charts/PDF/U2000.pdf General Punctuation
……
There are also many Chinese-related symbols, such as Yijing Hexagram Symbols or Kanbun, but it's off-topic anyway. I write non-chinese-characters in CJK to have a better explanation of what Chinese characters are. And the ranges above already cover almost all the characters which appear in Chinese writing except math and other specialty notation.
Supplementary
CJK Symbols and Punctuation
、。〃〄々〆〇〈〉《》「」『』【】〒〓〔〕〖〗〘〙〚〛〜〝〞〟〠〡〢〣〤〥〦〧〨〩〪〭〮〯〫〬〰〱〲〳〴〵〶〷〸〹〺〻〼〽 〾 〿
Halfwidth and Fullwidth Forms
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン゙゚ᄀᄁᆪᄂᆬᆭᄃᄄᄅᆰᆱᆲᆳᆴᆵᄚᄆᄇᄈᄡᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ¢£¬ ̄¦¥₩│←↑→↓■○
Refer
https://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97 (in chinese
language, notice the right side bar)
https://zh.wikipedia.org/wiki/%E4%B8%AD%E6%97%A5%E9%9F%93%E7%9B%B8%E5%AE%B9%E8%A1%A8%E6%84%8F%E6%96%87%E5%AD%97
(notice the bottom table)
http://www.unicode.org
The Unicode code blocks that the others answers gave certainly cover most of the Chinese Unicode characters, but check out some of these other code blocks, too.
CJK_UNIFIED_IDEOGRAPHS
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_C
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_D
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_E
CJK_COMPATIBILITY
CJK_COMPATIBILITY_FORMS
CJK_COMPATIBILITY_IDEOGRAPHS
CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT
CJK_RADICALS_SUPPLEMENT
CJK_STROKES
CJK_SYMBOLS_AND_PUNCTUATION
ENCLOSED_CJK_LETTERS_AND_MONTHS
ENCLOSED_IDEOGRAPHIC_SUPPLEMENT
KANGXI_RADICALS
IDEOGRAPHIC_DESCRIPTION_CHARACTERS
See my fuller discussion here. And this site is convenient for browsing Unicode.
Unicode continually evolves, with the current goal to have "A new major version of the standard will be released each year. Starting with Unicode 14.0, each of those releases is targeted for the third quarter of each year."
Without a single community wiki that someone regularly updates, if you want to maintain coverage for corrections and additional extensions, to stay up-to-date be sure to also double check the latest standard, always found at: https://www.unicode.org/versions/latest/ And look for the East Asia chapter (unless that one day gets split as well).
As of this initial writing, the latest is v14, and Ch 18 "presents scripts used in East Asia. This includes major writing systems associated with Chinese, Japanese, and Korean. It also includes several scripts for minority languages". The first table reviews Blocks Containing Han Ideographs where we see they've gone up to Extension G:
Block Range Comment
-----------------------------------------------------------
CJK Unified Ideographs 4E00–9FFF Common
CJK Unified Ideographs Extension A 3400–4DBF Rare
CJK Unified Ideographs Extension B 20000–2A6DF Rare, historic
CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E 2B820–2CEAF Rare, historic
CJK Unified Ideographs Extension F 2CEB0–2EBEF Rare, historic
CJK Unified Ideographs Extension G 30000–3134F Rare, historic
CJK Compatibility Ideographs F900–FAFF Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants
The second table Small Extensions to CJK Blocks notes additions: "The repertoire in the CJK Unified Ideographs block has subsequently been extended with small sets of unified ideographs or ideographic components needed for interoperability with various standards, or
for other reasons, as shown in Table 18-2", some of which "have involved reserved ranges at the end of other CJK blocks."
For additional related blocks such as punctuation and other syllabaries (including for J+K) which should be more stable, check out that unicode chapter further as well as other answers around here, and https://en.wikipedia.org/wiki/Han_unification#Unicode_ranges. https://blog.miniasp.com/post/2019/01/02/Common-Regex-patterns-for-Unicode-characters has some interesting discussion as well even though it was written in 2019.
For fonts that try to render these, see https://en.wikipedia.org/wiki/List_of_CJK_fonts, but note that coverage information is sparse. You'll have to dig around to see those details, e.g. Adobe/Google's Source Han/Noto fonts don't cover all extensions or compatibility ideographs.
To summarize, it sounds like these are them:
var blocks = [
[0x3400, 0x4DB5],
[0x4E00, 0x62FF],
[0x6300, 0x77FF],
[0x7800, 0x8CFF],
[0x8D00, 0x9FCC],
[0x2e80, 0x2fd5],
[0x3190, 0x319f],
[0x3400, 0x4DBF],
[0x4E00, 0x9FCC],
[0xF900, 0xFAAD],
[0x20000, 0x215FF],
[0x21600, 0x230FF],
[0x23100, 0x245FF],
[0x24600, 0x260FF],
[0x26100, 0x275FF],
[0x27600, 0x290FF],
[0x29100, 0x2A6DF],
[0x2A700, 0x2B734],
[0x2B740, 0x2B81D]
]