Unicode fonts for Japanese - unicode

I am creating a game. I have some UI with text. Recently we wanted to add Japanese language version but I have problem with fonts. I use stb_freetype to rasterize fonts and I support Unicode so it should not be a problem. But most fonts doesn't seem to contain Janapese characters, on Windows I've found that Arial Unicode does. But its size is 26 MB, that's much more than our complete game!
I've seen Unicode and fonts but it doesn't cover my questions completly.
So basically I'm asking about 2 things:
Does Janapese fonts have different typefaces? I mean, Western fonts have serif, sans-serif or more exotic versions. Does this apply also to Asian fonts?
I probably would use system font rather than providing such a big file myself. I know how to locate Arial Unicode on Windows, but our game have also versions for Mac OSX, Linux and iOS. Where can I find Unicode fonts (and which ones should I use) on those platforms? Especially I'd be intrested about Linux, because this is least familiar platform for us.

most fonts doesn't seem to contain Janapese characters, on Windows I've found that Arial Unicode does. But its size is 26 MB, that's much more than our complete game!
Arial Unicode contains a lot more than just Japanese. It's also in general not a very good font: it is made to cover a lot of Unicode code points, but it is missing many features needed to actually render some languages properly. Not to mention it is not freely redistributable.
I suggest looking at the free Japanese fonts used by Linux distributions. For example VLGothic is 3.7MB and compresses down to just 2.2MB, which would be much more palatable. See also: Takao, Motoya, Togoshi.
Does Janapese fonts have different typefaces? I mean, Western fonts have serif, sans-serif or more exotic versions. Does this apply also to Asian fonts?
Certainly. Japanese (and other Han-derived fonts in general) vary widely, just as Latin does. Generally fonts might be categorised as:
Gothic: typically unstressed, without line-endings, with little sign of the original brushed nature of the characters. Most similar to Latin ‘sans-serif’ fonts—indeed, the name ‘Gothic‘ is taken from exactly that tradition.
Often used as default screen fonts as they render well in reduced detail. As well as square-ended Gothic Kaku, there's Gothic Maru which uses rounded features, matching well with Latin rounded sans.
Minchō: has serif-like endings stylised from the brush strokes, and strong vertical stress. Often formal in appearance. Most similar to Latin ‘serif’ fonts, typically paired with a transitional serif design. Often the default Japanese font for word processing, paired with Times New Roman.
Kyōkasho (‘textbook’): formal handwritten style, clear and readable, but less straight-edged than Mincho. Most similar to a legible Latin pen-written script font; might also usefully be paired with a more characterful serif.
Kaisho: traditional brushed style, but still regular and legible, somewhat formal. Not usually so good at low screen resolutions. Might be paired with a semi-serif or brushed script Latin face.
Gyōsho: cursive brushed style, less clear, typically for display purposes. Also Sōsho takes this further, to generally-illegible lengths.
Display fonts. There are some everyday handwritten-like styles, but typically fewer really wacky novelty fonts. Presumably because the amount of work involved in creating a font to cover the huge number of common Kanji makes it not worth it. You may also find novelty fonts that contain only the kana and Latin (rōmaji) characters, with few or no kanji.

The file size is not the only reason to avoid redistributing Microsoft technology to Linux or Mac!
1.Does Janapese fonts have different typefaces? I mean, Western fonts have serif, sans-serif or more exotic versions. Does this apply also to Asian fonts?
CJK (Chinese, Japanese, Korean) fonts do have different typefaces. Some fonts are more caligraphic and others are more plain. It is roughly analogous to the difference between san-serif and serif. (There is no notion of italics or bold in CJK fonts. I understand that there are conventions for expressing emphasis, but I don't recall what they are.)
You can see these font differences on Windows by comparing Arial Unicode MS with MingLiU or MS Mincho. The Arial version is very "plain" in both Latin characters and CJK idiographs.
As with Latin characters, I believe the distinction in typefaces is purely for visual appeal, and does not strongly imply any difference in meaning.
I can not help you with your second question.

Yeah you need to license a font for distirbution. You can't simply use the one out of your computer...For file size you should just pull the specific code pages. typically arial unicode is $3500 to embedd in a game...however thats the price for all 50,000 characters....
Source: Monotype Imagaing

Yes. As a simple example, just look at the differences between MS PGothic and MS PMincho, which should be available if you have Microsoft Word installed.
Unfortunately I have no experience with Unicode fonts on OSX or Linux, so I can't help you there.

Related

Font for "math bold script" unicode charset

I wouldn't believe I have been stuck on this for one hour, but it seems the fonts for extended unicode characters are not easyly available as TTF / OTF for use on computers, especially with graphic software where unicode fallback doesn't work
especifically I looking for the so called Math bold script
somehting like : 𝓓𝓮𝓶𝓸 𝓯𝓸𝓷𝓽 𝓐𝓑𝓒𝓖𝓟 𝓮𝓻𝓽𝓷𝓭 (<- those are extended chars)
as in https://textfancy.com/font-converter/
as imagen at: https://snipboard.io/fNYd7w.jpg
(becouse I am not sure we all see the same glyphs)
Note: what I am looking for, is a standrd TTF font, which normal glyphs are equal to those extended glyphs, meaning that the A looks like the 𝓐, B like 𝓑, and so on. So I could use the font as normal font in every software.
The STIX math fonts support the Unicode Mathematical Alphanumeric Symbols block.
https://www.stixfonts.org/
https://github.com/stipub/stixfonts
(Note: the variable fonts don't include support for that block of characters; only the static fonts do.)
Please note the intended use of those Unicode characters, as pointed out in the STIX project:
The sans serif, fraktur, script, etc., alphabets in Plane 1 (U+1D400-U+1D4FF) are intended to be used only as technical symbols.

Is the Demotic script represented in Unicode?

Does Unicode have signs for Demotic script? Is there any font containing such signs?
The Unicode has assigned 1072 characters for Egyptian hieroglyphs and for Hieratic (which is the parent system for Demotic and the cursive version of hieroglyphs) - so I wonder if there is any Unicode support for Demotic too
Although Demotic is still not encoded, there are already texts encoded in rich-text documents (using specific fonts).
They are based on the Coptic script, with a few additions for the diacritical Yodh on some letters; this works with some ligatures and slightly modified letter forms, but this is not purely a "hack" because in fact the Coptic script was developed from Demotic (on its cursive form used in Thebes) with the simplified forms from Greek adapted for the Late Ancient Egyptian language (which was then transcribed in the same period and the same area of Thebes with BOTH the Demotic and Coptic scripts; while the Demotic script also coexisted with Hieratic, i.e. the cursive form of the complex hieroglyphs highly simplified).
You can see this here:
https://ucbclassics.dreamhosters.com/djm/demotic.html
This work is the working base for a future encoding of Demotic in Unicode, but many searchers can use this font (and the keyboard input layout, which is based on Classical Greek, with a few modifications) on MacOS, Windows and now as well Linux, within several Office word processors, and now as well on the web (provided the web browsers support Opentype features, and webfonts). It still does not allow plain-text, but this works, using the Coptic encoding (with just a few additional generic diacritics, plain-text is possible and even directly readable by Egyptologists).
So the good question is: will Demotic be encoded separately, or will Unicode just consider to unify it with Coptic with the few additions needed? Unicode already chose to unify Egyptian hieroglyphs with Egyptian Hieratic, but this is quite controversial as Hieratic is very far from hieroglyphs (currently encoded for its monumental form carved on stone that have been used with lots of variants during 2 millenia), and much nearer from Demotic.
So may be Demotic will be encoded separately by Unicode (to avoid breaking the modern Coptic script still used today) but unified with Hieratic (which will be separated from Hieroglyphs). This would create an Unicode "Hieratic-Demotic" script, i.e. "Late Egyptian Cursive" (not to be confused with "Egyptian Cursive Hieroglyphs", which is extremely similar to the older monumental Hieroglyphs, but were developped to be painted on papyrus instead of being carved in monumental stones, so their form is much less angular and a bit simplified by the speed of drawing with a brush, but a lower precision of the brush and diffusion of ink on papyrus). For now it is not decided. But Egyptologists already have their tools to create documents easily and discuss them... using a rich text form.
There are other existing fonts. However msot of them are not free. They initially requires proprietary rich text formats, but this is not logner the case with free office suites like LibreOffice and OpenOffice (which can also process MsOffice formats, all supporting as well the ODF formar instead of the old MS formats). Note that ODF is easily convertible to HTML+CSS: this makes publication on the web possible as well.
Note that for Egyptian Demotic, you need much less characters than for Egyptian Hieroglyphs and Egyptian Hieratic: using the Coptic set (mostly based on Greek) with a few diacritics (much less than those used in Classical Greek!) along with rich-text and specific font designs is still the best choice today.
But the most important problem with borrowing the Coptic script for writing Demotic is the directionality (note that this is also a problem inside the Greek script for writing Ancient Greek...)
Also Unicode still does not support boustrophedon correctly and does not support a suitable model the layout needed for hieroglyphs that are encoded, with the same level that Unicode adapted its model for Hangul squares compositions and for the vertical rendering of sinographic scripts! This will also be a problem for other scripts still to be encoded (e.g. SignWriting, or chemical and mathematical notations, or musical notations; all of them having modern use but requiring specific layouts that are still not representable in plan-text with jut Unicode encoding alone).
So you can't do all you want with just Unicode plain-text, and you need rich-text formats: a solution may be found with HTML+CSS, then supported by OpenType, long before Unicode decides doing something, or just resignates to do nothing before long (because most modern scripts are encoded and there are less companies interested in paying the development of paleographic scripts, and paying their membership to add it and work on it), or there's some new proposals to better encode complex text layouts than just basic directionality (and syllabic square layouts in Hangul, or Arabic-like and Brahmic conjoining layouts, all of them being fully supported by their specific properties) !
Another source you may look at, for a candidate font is
http://paleography.atspace.com/
which introduces this set of 279 paleographic fonts for 30 old scripts, available at:
https://download.cnet.com/Paleofonts/3000-2190_4-10547504.html
or individually at:
https://github.com/reclaimed/paleofonts
(which is where resides now all the archived fonts).
However this huge set only contains one "Demotic" font (in fact for the "Meroitic Demotic" script, not the Egyptian Meroitic, which has partial coverage with just mappings on top of ASCII Latin letters and not the needed diacritics and necessary ligatures). And this legacy font set does not have the quality that we find today: no OpenType features (only TrueType), no or incomplete Unicode mappings, partial coverage, poor metrics, no hinting: they are just small enough to replace fallback fonts that would just display mojibake in Unicode, or for legacy texts translittated to other input scripts.
So many of these paleographic scripts will be developed by community efforts (e.g. within the Noto opensourced project, and with help of Unicode contributors and other opensourcers to work on them and find and discuss the rare ressources used by paleographers). You'll have to be very patient or try to develop you own community of interest with rare linguists spread in universities around the world with very small budgets, which often have poor knowledge of the technical requirements for developing modern fonts.
However there's now a renewal of efforts, because tools to develop fonts are easier and more reliable to use, and just a few persons with good contacts (in various working languages) could seriously help develop this support that many linguists and poor students would appreciate for their work to revive this important human heritage: Egyptian Demotic with its 2600 years of active use and its real importance for many cultures with which it has been in contact, is really a big gap we should fill. Unicode is just waiting for proposals and active experimentations and talks (which should also involve other standard bodies like W3C for CSS Text, and OpenType for font designs, and various OS vendors). Of course, if this development requires encoding additional characters in the UCS for usage of these scripts in plain-text, ISO working groups will be involved too and will need to agree with Unicode (but we know that this can take many years after proposing encoding new scripts or desunifying any existing script).

Where are the unicode characters on the disk and what's the mapping process?

There are several unicode relevant questions has been confusing me for some time.
For these reasons as follow I think the unicode characters are existed on disk.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
There's a concept of UCD (unicode character database), and We can download it's latest version. UCD latest
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
So if the unicode characters does existed on the disk , then :
Where is it ?
How can I upgrade it ?
What's the process of mapping the unicode code point to a glyph ?
If I use a specific font, then what's the process of mapping the unicode code point to a glyph ?
If not, then what's the process of mapping the unicode code point to a glyph ?
It will very appreciated if someone could shed light on these problems.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
That's echo -e in bash.
› echo "\u6211"
\u6211
› echo -e "\u6211"
我
Where is it ?
In the font file.
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
How can I upgrade it ?
Installing/upgrading a suitable font with the emojis should be enough. I don't have macOS, so I cannot verify this.
I use "Noto Color Emoji" version 2.011/20180424, it works fine.
What's the process of mapping the unicode code point to a glyph ?
The application (e.g. text editor) provides the font rendering subsystem (Quartz? on macOS) with Unicode text and a font name. The font renderer analyses the codepoints of the text and decides whether this is simple text (e.g. Latin, Chinese, stand-alone emojis) or complex text (e.g. Latin with many marks, Thai, Arabic, emojis with zero-width joiners). The renderer finds the corresponding outlines in the font file. If the file does not have the required glyph, the renderer may use a similar font, or use a configured fallback font for a poor substitute (white box, black question mark etc.). Then the outlines undergo shaping to compose a complex glyph and line-breaking. Finally, the font renderer hands off the result to the display system.
Apart from the shaping, very little of this has to do with Unicode or encoding. Font rendering already used to work that way before Unicode existed, of course font files and rendering was much simpler 30 years ago. Encoding only matters when someone wants to load or save text from an application.
Summary: investigate
Truetype/Opentype font editing software so you can see what's contained in the files
font renderers, on Linux look at the libraries pango and freetype.
Generally speaking, operating system components that use text use the Unicode character set. In particular, font files use the Unicode character set. But, not all font files support all the Unicode codepoints.
When a codepoint is not supported by one font, the system might fallback to another that does. This is particularly true of web browsers. But ultimately if the codepoint is not supported, an unfilled rectangle is rendered. (There is no character for that because it's not a character. In fact, if you were able to copy and paste it as text, it should be the original character that couldn't be rendered.)
In web development, the web page can either supply or give the location of fonts that should work for the codepoints it uses.
Other programs typically use the operating system's rendering facilities and therefore the fonts available through it. How to install a font in an operating system is not a programming question (unless you are including a font in an installer for your program). For more information on that, you could see if the question fits with the Ask Different (Apple) Stack Exchange site.

Why isn't there a font that contains all Unicode glyphs?

Pretty much as the title says. Rendering all of the unicode format correctly what with composite characters and characters that affect other characters and ligatures is really hard, I understand that. We have fonts that seem to be designed for maximum Unicode symbol support(Symbola, Code2001, others) and specialized fonts for certain planes or character ranges(BabelStone Han, others).
I don't know much about the underlying technical details for fonts. Is there a maximum size? Is it a copyright problem? Is essentially redrawing all ~110,000 extant glyphs too hard? I understand style concerns, but why not fall back to a 'default' font that had glyphs for everything? They're on unicode.org, redrawing them all would be pretty hard work but then you'd have a guaranteed fallback font for everything. If you got rights to some pre-existing fonts you could just composite them and that should help a lot. Such a font would be a great help to humanity and I can't see a good technical reason why it doesn't exist or at least an open-source effort to create it, so I presume an invisible-to-me reason why it can't be done.
What is that reason?
"Why would you even want that?" questions aside, from a programming perspective there's a very simple reason: the OpenType spec only affords an addressable glyph index space of one USHORT, so one font can only support 16 bits worth of glyphs identifiers, or 65,536 glyphs max. (And note the terminology: a "glyph" is not the same as a "character" or "letter")
The current version of Unicode, v8 as of this answer, contains 120,737 assigned code points, or almost twice as many as fit in a modern font (2021 edit: v13 upped this number to 143,859). In fact, Unicode hasn't been able to fit in a modern OpenType font since 2001, with the release of Unicode 3.1, which upped the number of code points from 49,259 to 94,205.
"So what about font collections?" I hear you ask. Why not use multiple fonts and support all unicode that way? Well now, you've just described Adobe's Sans Pro, and Google's Noto (which are the same font).
As for the "how hard can it be": a uniform style for all glyphs in Unicode, across 129 established written scripts on this planet, each with their own typesetting rules? Incredibly hard. You may think fonts are just files with pictures for letters, and someone types a letter, that picture shows up: that is not how fonts work, and isn't how fonts have worked since the late 1980's.
Modern fonts are the typographic equivalent of a game ROM: sure, it's not much use without the hardware or software to run that ROM on, but all the things that actually matter are in the ROM. Similarly, modern fonts contain all the information for typesetting. Not just pictures, they contain the metadata, the metrics, the positioning and substitutions rules for arbitrary sequences, with separate rule sets for each written script that OpenType supports, mandatory and optional ligatures, language-specific character replacements for letters at the start/middle/final position in a word, or in isolation, character repositioning relative to arbitarily complex sequences of other characters either before or after it, arbitrarily complex sequence replacements with other arbitrarily complex sequences, possible bitmap fallbacks for small-point rendering, hinting instructions on how to properly rasterize vector graphics that are inherently not aligned to any particular pixel grid, and more. A modern font is a ridiculously complex application, that a font engine consults to figure out how to typeset sequences of code points.
Making a (set of) Unicode-encompassing font(s) that looks good for all contexts is a vast team effort.
So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001. We can, and do, make font families that cover all of Unicode, but with 129 different scripts all with their own typesetting rules, it's a lot of work, and almost (almost) not worth the effort compared to only covering a subset of all languages.
And as for this:
Such a font would be a great help to humanity and I can't see a good technical reason why it doesn't exist or at least an open-source effort to create it, so I presume an invisible-to-me reason why it can't be done.
Just because you didn't know about them, doesn't mean they don't exist, with millions of people who are familiar with them. They exist =)
They're even open source, go out and thank the people who made them!
There is GNU Unifont. It aims to contain all Unicode, except Apple Emoji.
You will probably find what you are looking for at the following links.
Unicode Character Table
HTML Character Entity References
Huge List of Unicode Symbols
List of Unicode Characters of Category “Other Symbol
This other is funny for particular character since you can draw what you search:
Unicode Character Recognition
Can't enter unicode character with Alt+ even with EnableHexNumpad
Basic Questions
Q: How many characters are in Unicode?
A: The short answer is that as of Version 13.0, the Unicode Standard contains 143,859 characters. The long answer is rather more complicated, because of all the different kinds of characters that people might be interested in counting.
Unicode font
A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet.
Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters (143,859 characters, with Unicode 13.0).
...
No single "Unicode font" includes all the characters defined in the present revision of ISO 10646 (Unicode) standard, as more and more languages and characters are continually added to it, and common font formats cannot contain more than 65,535 glyphs (about half the number of characters encoded in Unicode).
As a result, font developers and foundries incorporate new characters in newer versions or revisions of a font, or in separate auxiliary fonts intended specifically for particular languages.
Enjoy!

What range of unicode characters should be kept in a #font-face web font for a US based website with a US audience?

As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)