Dynamically generating Ge'ez unicodes - unicode
Hi. If you look at the image above, you will see a set of very weird-looking characters displayed along with some Latin characters. The weird ones are Eritrean characters. They are the characters we use in my country. So, to go strait to the point, I am hoping to create even the simplest possible bit of software or maybe even a batch file (if possible) to help me make these characters applicable on the web and make PCs understand and display them when being typed. Just like Arabic, Hindu, Chinese... characters are used. I think, since the question of 'creating a language' is often rare or because I may not know the correct term to use, when I searched the internet to find any tutorial or even a freelancer or anything, all I got was... nothing. So, I am hoping, if anyone can give me a step-by-step guide, or even just a clue about how to create this, would be very helpful.
Thanks.
Your question asks "how to create a language", so I will describe all the pieces that need to be in place for a new language (or more accurately, writing system). You ask specifically about the Eritrean alphabet, so I will provide specific examples of how that is supported on modern systems, and try to provide you pointers for the pieces you are missing. The answer is long, and provides lots of links, to support the two explanations.
To work with a script like Ge'ez (also known as Ethiopic, the script used to write Amharic in Ethiopia and Tigrinya in Eritrea) you need a few things. The first is a way to encode the characters; a set of numbers representing each character, that the computer can use to represent the text. Luckily, Unicode has become widespread, and Unicode is designed to be a universal character set that includes all of the world's languages. Unicode 3.0 introduced Ethiopic in the range U+1200-U+137F, and later versions added supplements of more obscure characters in the ranges U+1380-U+1394, U+2D80-U+2DDF and U+AB00-U+AB2F. If you wanted to support a language that Unicode didn't yet support, you would either need to use the private use area and define your own mapping of characters to code points, or submit a proposal to have your script added to Unicode; for example, see the proposal for Ethiopic.
Now, Unicode is just a character set; an abstract mapping between characters and numbers. To actually transmit these characters as a sequence of bytes, you use a character encoding. There are many encodings; some of them, like ASCII and ISO-8859-1 only cover a subset of the full Unicode character set, while others, like UTF-8 and UTF-16, cover the full range. For documents on the web, UTF-8 is the recommended character encoding; you should never use anything else if you can help it. In UTF-8, you can write Ge'ez directly in the document, for example: ኤርትራ. One thing to watch out for is that some programs (especially on Windows) will offer you "Unicode" as an encoding, when they mean UTF-16; you want to make sure to choose UTF-8, as it's more efficient and more compatible with a wider variety of software.
If you are using encodings that don't cover the full range of Unicode, or you don't have a good way to type those characters, and you are writing HTML or XML, you can use numeric character references instead. To do this, you write the Unicode code point of the character you want to refer between &# and ;. You can write the number in decimal, or in hexadecimal prefixed with an x. For example, ሀ can be written ሀ or ሀ (the semicolon at the end is important; it wasn't working for you in the comments because you were missing it).
Now that you have a character set, and a way of encoding it, you need a way to display it. Some scripts are easier to display in others. For all scripts, you need a font; a file defining how each character looks. A font contains a collection of glyphs, or drawings of each character. Some scripts, like the Latin alphabet (the alphabet used for English and most European languages) are relatively simple; each character is a separate glyph, and how they are drawn doesn't depend on what characters come before or after (though diacritics and ligatures can make it a little more complicated). Others, like Arabic and Indic scripts are written in cursive, where letters join to each other so how they are drawn can depend on the characters near them. These languages require special rendering support like Uniscribe or DirectWrite on Windows, Pango on Linux, or advanced font technology like Apple Advanced Typography or Graphite.
Luckily, Ge'ez is a fairly simple writing system, that doesn't require any specialized rending support or advanced font systems. Each of the characters is a separate glyph, and it doesn't require any reordering. So a normal OpenType font, displayed with the rendering systems already available on most computers, will do the job. But you still need the font in order to be able to display the characters. To create you own font, you can use FontForge (a free/open source tool), Fontographer, FontLab Studio, or other similar software.
For Ethiopic, you don't need to create your own. There are numerous fonts available that include the Ethiopic characters, but one that I would recommend is Abyssinica SIL from SIL (the Summer Institute of Linguistics), which does a lot of great work for minority languages and writing systems. Their fonts are available under a free license, that allows you to use the font, redistribute the font, and modify the font, so their fonts are quite flexible and can be used in a wide variety of situations. Windows ships with Nyala, which includes Ethiopic characters, since Windows Vista, and Ebrima, which added support for Ethiopic characters in Windows 8; so people on Windows Vista or later should be able to view Ethiopic characters already. Mac OS X ships with Kefa as of 10.6.
Once you have the font, you will be able to view Ethiopic characters. But other people reading your documents might not have those fonts (if they are using an older version of Windows or Mac OS X, if they didn't install all of the fonts that came with Windows, or the like), in which case the characters will probably show up as boxes or question marks on their machine. You could give those people a redistributable font like Abyssinica SIL, or they could buy a font that includes Ethiopic characters, but that can be inconvenient. For working with word processor documents or plain text, that's probably the best you can do; they will need the font installed on their computer to be able to display the text. If you create a PDF on your computer, it should embed the fonts that it needs to display the text, so creating a PDF can be a convenient way to include uncommon fonts with your document.
On a web page, you can use web fonts to link to a font from your stylesheet, allowing the users web browser to load that font for that web page. Web fonts are supported all the way back to IE 6, and in recent versions of most other web browsers, so they are actually quite widely supported. Different web browsers support different font file formats (EOT, TTF, OpenType, SVG, and WOFF), and slightly different syntaxes for the CSS (older versions of IE are based on an older draft), so it can be a bit tricky to make a page that is compatible with all browsers. Luckily, people have automated that process. Some web fonts are available online from Google Web Fonts or FontSquirrel, but sadly, I couldn't find any Ethiopic fonts already hosted. However, you can upload a font to FontSquirrel, and it will convert it into all of the major formats, and provide example CSS that will work on all modern browsers. Note that you should only do this with fonts that allow web embedding; not all fonts do. Since Abyssinica SIL is available under the Open Font License, you can use it, and I've run it through FontSquirrel for you; you can see how it works (check out the Glyphs & Languages tab), or download the kit. To use it, just put the font files (.ttf, .eot, .svg, and .woff) on your server in the same directory as your CSS, and include the following in your CSS:
#font-face {
font-family: 'abyssinica_silregular';
src: url('abyssinicasil-r.eot');
src: url('abyssinicasil-r.eot?#iefix') format('embedded-opentype'),
url('abyssinicasil-r.woff') format('woff'),
url('abyssinicasil-r.ttf') format('truetype'),
url('abyssinicasil-r.svg#abyssinica_silregular') format('svg');
font-weight: normal;
font-style: normal;
}
Now that you know how to encode Ethiopic, view Ethiopic characters, and share documents containing Ethiopic characters, you are probably going to want to type them into documents. If you are using HTML, you could just type the numeric character reference described above. In other documents, you could just copy and paste the characters from a chart of all of them, like the Wikipedia page. But that would become pretty cumbersome. Depending on your system and settings, you can also use Unicode Hex Input to enter arbitrary Unicode characters, but that is also cumbersome.
To fully support typing a script on your computer, you need a keyboard layout or input method. Some scripts can be typed with a simple keyboard layout, which says which keys correspond to which characters. If a script has more characters than there are keys on the keyboard, Shift and Alt (or Option on the Mac) can be used to map to more characters. Dead keys can also be used to expand the range of characters that you type; dead keys are sequences of two or more keystrokes that produce a single glyph; for example, on Mac OS X, to type "á", you can type Option-E A. To create a keyboard layout on Windows, you can use the Microsoft Keyboard Layout Creator. Mac OS X uses an XML format for keyboard layouts, so you can create one directly, or use Ukelele from SIL to create one more easily. On systems using X11 (like Linux), you can create your own XKB layouts.
If you need more characters than can be supported with modifiers and dead keys, like typing Chinese or Japanese, then you need a full-fledged input method. An input method allows you to run arbitrary code to map what someone types into the text it produces; for example, in a Japanese input method, you may type a phonetic representation of what you you are writing, and it will show you a drop down list of possible characters that match that representation, allowing you to choose the appropriate ones. Windows provides the Input Method Manager for writing input methods, Mac OS X the Input Method Kit, and X11 has a few ways to do it, such as SCIM and iBus.
The standard input method for Ethiopic makes extensive use of dead keys. It looks like the most popular existing input method for Ethiopic is Keyman, which is a commercial input method that works on Mac and Windows, and in addition there's a free variant, KMFL, that works on Linux. SIL has keyboard downloads for this input method; they also have a keyboard layout for Mac OS X which uses dead keys to achieve the same thing. Mac OS X has more extensive dead key support, so it doesn't require an input method to support this form of input, while on Windows you need to use an input method like Keyman to be able to enter input this way. Google has a free input method for Windows, Google Input Tools for Windows, which supports Amharic, and allows you to customize its input schemes; you could try adapting their Amharic support for Tigrinya.
If you just need to support input on a web site, you could do this in JavaScript, by writing an input method in JavaScript that transliterates from what someone types into Ethiopic. I do not know of any existing frameworks for doing this; however, I have found Korean and Japanese input methods implemented in JavaScript. You could take a look at how those are implemented. Upon looking further, I've found that Tavultesoft, who make Keyman, also have KeymanWeb, a JavaScript based input method that you can buy and embed in your site. MediaWiki also has an input method extension Narayam, that includes a JavaScript based input method for MediaWiki based sites like Wikipedia, which includes an experimental Amharic input method. There is also a draft W3C IME API, which helps provide an interface between web apps and native IMEs, as well as JavaScript based IMEs. Given that it's still a draft, I don't know if it is yet supported anywhere.
With all the above (a character set, encoding, fonts, rendering support, and an input method), you will be able to create, share, and view documents in your script. If that's all you need, great; the above will allow you to work with documents in a given script. But for full support for a language on your computer, not just its script or writing system, there are two more pieces that you need: a locale, and your software to be localized (translated and adapted) for your language.
A locale specifies how programs should manipulate text in a given script, language, culture, and/or encoding. There are many common text processing operations that programs do: displaying numbers, displaying dates and times, sorting strings or names, and so on. How these should work can differ based on the language, script, and culture of the person using the program; for instance, in Swedish "ü" is sorted along with "y", while in English and German it's sorted along with "u". Differences may not be based on language: both Mexico and Spain use Spanish, but in Mexico numbers are displayed with . as the decimal separator (1½ is written "1.5"), while in Spain , is used as the decimal separator (1½ is written "1,5"). A locale specifies all of these rules. Because the locale can vary based on language, culture, and sometimes other factors, the language and country are usually used to specify the locale, and other information can be used as well.
The most widely used standard for naming locales is RFC 4646 (BCP 47). Locales are usually specified as "ln-CC" with the language code ln and country code CC: US English is en-US, British English is en-UK, and French in France is fr-FR. If more information needs to be specified, it can be included. For instance, Serbian can be written with either Latin or Cyrillic, and so Serbian in Serbia can be either sr-Latn-CS or sr-Cyrl-CS. Tigrinya in Eritrea is written ti-ER.
There are a variety of different formats for defining the rules that a particular locale has. Windows uses NLP files, a custom format that can be created with Microsoft Locale Builder. POSIX (Unix/Linux) locales can be created using localedef. Many systems these days are moving towards the Unicode Common Locale Data Registry, which specifies a standardized format for locale data as well as a comprehensive database of locales for many of the worlds languages. ICU is a library for C and Java (and used by many other environments) for manipulating Unicode text according to Unicode rules and locale data; they have a good browser for the data from the CLDR and their own locale data. For example, take a look at their entry for ti-ER.
Finally, for full support of a language, you need to translate the software itself into that language. There are, of course, many pieces of software, and each one contains many strings that need to be translated. Some software is not designed to be translated; it has not been internationalized. Some software can only be translated by whoever created it; the strings are built into the program and cannot be easily modified by a third party. But it is possible to localize some software, translating it to your language and culture. If the software has already been localized for several other languages and cultures, it is likely to be flexible enough to support a new language, and if it uses formats that are easily modifiable for localization information, it can be modified by third parties.
For instance, applications on Mac OS X store their localization data in separate files within the application bundle. There is a tool called AppleGlot (you need to register for the Mac Developer Program and go to the downloads area to find it) which can help you extract that data, provide a file with all of the strings which need to be translated, and allow you to combine that with the application again once you have. For open source software, such as much software available on Linux, you can work with the developers to provide translation. Some software uses gettext for translation strings, which use the PO file format that you can edit using poedit. Some uses Qt, for which you can use Qt Linguist. Or for dealing with a wide variety of formats, you can use a commercial offering like Swordfish or Transifex.
Of course, no one person can do all of the above; it takes many people working together to build support for a new language on modern computer systems. This is all intended to be a high-level tour of all of the components that go into language support for a given language, with references that will help you follow up on whichever aspects you would like to work on, as well as demonstrate what already works for Tigrinya and the Ge'ez script.
If they are Unicode characters they should be displayable just like characters of any other language. I googled it and found this, hopefully they're the same ones you're asking about:
የ ዩ ዪ ያ ዬ ይ ዮ
ዸ ዺ ዻ ዼ ዽ ዾ
See? No extra work required to display them on web browsers or other programs.
These are characters from the Unicode Ethiopic set (U+1200..U+137C), encoded in UTF-8:
Line 1:
የ = 0xE1 0x8B 0xA8 = U+12E8 = ETHIOPIC SYLLABLE YA
ዩ = 0xE1 0x8B 0xA9 = U+12E9 = ETHIOPIC SYLLABLE YU
ዪ = 0xE1 0x8B 0xAA = U+12EA = ETHIOPIC SYLLABLE YI
ያ = 0xE1 0x8B 0xAB = U+12EB = ETHIOPIC SYLLABLE YAA
ዬ = 0xE1 0x8B 0xAC = U+12EC = ETHIOPIC SYLLABLE YEE
ይ = 0xE1 0x8B 0xAD = U+12ED = ETHIOPIC SYLLABLE YE
ዮ = 0xE1 0x8B 0xAE = U+12EE = ETHIOPIC SYLLABLE YO
Line 2:
ዸ = 0xE1 0x8B 0xB8 = U+12F8 = ETHIOPIC SYLLABLE DDA
ዺ = 0xE1 0x8B 0xBA = U+12FA = ETHIOPIC SYLLABLE DDI
ዻ = 0xE1 0x8B 0xBB = U+12FB = ETHIOPIC SYLLABLE DDAA
ዼ = 0xE1 0x8B 0xBC = U+12FC = ETHIOPIC SYLLABLE DDEE
ዽ = 0xE1 0x8B 0xBD = U+12FD = ETHIOPIC SYLLABLE DDE
ዾ = 0xE1 0x8B 0xBE = U+12FE = ETHIOPIC SYLLABLE DDO
Using Ethiopian characters on web pages is mostly a matter of fonts these days. (You may also have a problem with entering them conveniently, but this depends on your authoring environmentPeople using e.g. Windows 7 have at least one font containing them, but old computers typically lack such fonts. The following fonts contain them (there may be others):
Code 2000, was freeware, the author has disappeared, so the status is obscure
Unifont, a free bitmap font
FreeSerif, a free font
Nyala, distributed with some versions of Windows
SunExt-A, a free font
Fixedsys Excelsior, a free bitmap font I suppose (haven’t tested)
I would probably use FreeSerif as a downloadable font, with #font-face.
Just came accross the same problem but there is a easy solution: Google provides now webfonts for many languages, also ethiopic:
http://www.google.com/fonts/earlyaccess
To write amharic or Tigrigna in web forms you can simply use Any Key firefox add on https://addons.mozilla.org/en-US/firefox/addon/any-key/ and there is for chrome too !!
But To create an editor using javascript you can see a site here http://www.lexilogos.com/keyboard/amharic.htm and try to firgure it out how they implemented it !!
You probably want to look at
http://senamirmir.org/
which unless I am wrong has done what you want to do.
If you don't like their fonts SIL Abyssinica should be fine too (but it only includes one writing style).
The layout status will vary from system to system, to target *nix like systems you need a layout merged in
http://www.freedesktop.org/wiki/Software/XKeyboardConfig/
#Samaya, by now you probably got the answer you were looking for. But let me drop what I think. Based on your original question, I think you are trying to develop a small software which can be selected as utility(as a feature) and be used to display Geez alphabets without the need of installing a separate Geez application. For that, I reckon, the utility application should be developed in a way that it could be selected as a feature (language feature) in an operating system (Like Amharic in windows for instance). However, your subsequent comments seem to focus more on displaying Geez characters on a web. As many have suggested, we already have that functionality. But if you still want to develop an application for it, I would suggest you to have unicode (U1260-በ for instance) array and matching transcription array of your choices from a keyboard ( be - በ for instance). Your application then would use the array of transcription when keyboard key are entered and match them to the unicode to show the right alphabet in Geez. Not sure if I fully understood what you're looking for but I myself with colleagues did a project that included this type of work for the particular application. By the way, do you have to install Geez software to view Tigrigna/Geez transcript based website? If so, check your version of browser.
Related
Is the Demotic script represented in Unicode?
Does Unicode have signs for Demotic script? Is there any font containing such signs? The Unicode has assigned 1072 characters for Egyptian hieroglyphs and for Hieratic (which is the parent system for Demotic and the cursive version of hieroglyphs) - so I wonder if there is any Unicode support for Demotic too
Although Demotic is still not encoded, there are already texts encoded in rich-text documents (using specific fonts). They are based on the Coptic script, with a few additions for the diacritical Yodh on some letters; this works with some ligatures and slightly modified letter forms, but this is not purely a "hack" because in fact the Coptic script was developed from Demotic (on its cursive form used in Thebes) with the simplified forms from Greek adapted for the Late Ancient Egyptian language (which was then transcribed in the same period and the same area of Thebes with BOTH the Demotic and Coptic scripts; while the Demotic script also coexisted with Hieratic, i.e. the cursive form of the complex hieroglyphs highly simplified). You can see this here: https://ucbclassics.dreamhosters.com/djm/demotic.html This work is the working base for a future encoding of Demotic in Unicode, but many searchers can use this font (and the keyboard input layout, which is based on Classical Greek, with a few modifications) on MacOS, Windows and now as well Linux, within several Office word processors, and now as well on the web (provided the web browsers support Opentype features, and webfonts). It still does not allow plain-text, but this works, using the Coptic encoding (with just a few additional generic diacritics, plain-text is possible and even directly readable by Egyptologists). So the good question is: will Demotic be encoded separately, or will Unicode just consider to unify it with Coptic with the few additions needed? Unicode already chose to unify Egyptian hieroglyphs with Egyptian Hieratic, but this is quite controversial as Hieratic is very far from hieroglyphs (currently encoded for its monumental form carved on stone that have been used with lots of variants during 2 millenia), and much nearer from Demotic. So may be Demotic will be encoded separately by Unicode (to avoid breaking the modern Coptic script still used today) but unified with Hieratic (which will be separated from Hieroglyphs). This would create an Unicode "Hieratic-Demotic" script, i.e. "Late Egyptian Cursive" (not to be confused with "Egyptian Cursive Hieroglyphs", which is extremely similar to the older monumental Hieroglyphs, but were developped to be painted on papyrus instead of being carved in monumental stones, so their form is much less angular and a bit simplified by the speed of drawing with a brush, but a lower precision of the brush and diffusion of ink on papyrus). For now it is not decided. But Egyptologists already have their tools to create documents easily and discuss them... using a rich text form. There are other existing fonts. However msot of them are not free. They initially requires proprietary rich text formats, but this is not logner the case with free office suites like LibreOffice and OpenOffice (which can also process MsOffice formats, all supporting as well the ODF formar instead of the old MS formats). Note that ODF is easily convertible to HTML+CSS: this makes publication on the web possible as well. Note that for Egyptian Demotic, you need much less characters than for Egyptian Hieroglyphs and Egyptian Hieratic: using the Coptic set (mostly based on Greek) with a few diacritics (much less than those used in Classical Greek!) along with rich-text and specific font designs is still the best choice today. But the most important problem with borrowing the Coptic script for writing Demotic is the directionality (note that this is also a problem inside the Greek script for writing Ancient Greek...) Also Unicode still does not support boustrophedon correctly and does not support a suitable model the layout needed for hieroglyphs that are encoded, with the same level that Unicode adapted its model for Hangul squares compositions and for the vertical rendering of sinographic scripts! This will also be a problem for other scripts still to be encoded (e.g. SignWriting, or chemical and mathematical notations, or musical notations; all of them having modern use but requiring specific layouts that are still not representable in plan-text with jut Unicode encoding alone). So you can't do all you want with just Unicode plain-text, and you need rich-text formats: a solution may be found with HTML+CSS, then supported by OpenType, long before Unicode decides doing something, or just resignates to do nothing before long (because most modern scripts are encoded and there are less companies interested in paying the development of paleographic scripts, and paying their membership to add it and work on it), or there's some new proposals to better encode complex text layouts than just basic directionality (and syllabic square layouts in Hangul, or Arabic-like and Brahmic conjoining layouts, all of them being fully supported by their specific properties) ! Another source you may look at, for a candidate font is http://paleography.atspace.com/ which introduces this set of 279 paleographic fonts for 30 old scripts, available at: https://download.cnet.com/Paleofonts/3000-2190_4-10547504.html or individually at: https://github.com/reclaimed/paleofonts (which is where resides now all the archived fonts). However this huge set only contains one "Demotic" font (in fact for the "Meroitic Demotic" script, not the Egyptian Meroitic, which has partial coverage with just mappings on top of ASCII Latin letters and not the needed diacritics and necessary ligatures). And this legacy font set does not have the quality that we find today: no OpenType features (only TrueType), no or incomplete Unicode mappings, partial coverage, poor metrics, no hinting: they are just small enough to replace fallback fonts that would just display mojibake in Unicode, or for legacy texts translittated to other input scripts. So many of these paleographic scripts will be developed by community efforts (e.g. within the Noto opensourced project, and with help of Unicode contributors and other opensourcers to work on them and find and discuss the rare ressources used by paleographers). You'll have to be very patient or try to develop you own community of interest with rare linguists spread in universities around the world with very small budgets, which often have poor knowledge of the technical requirements for developing modern fonts. However there's now a renewal of efforts, because tools to develop fonts are easier and more reliable to use, and just a few persons with good contacts (in various working languages) could seriously help develop this support that many linguists and poor students would appreciate for their work to revive this important human heritage: Egyptian Demotic with its 2600 years of active use and its real importance for many cultures with which it has been in contact, is really a big gap we should fill. Unicode is just waiting for proposals and active experimentations and talks (which should also involve other standard bodies like W3C for CSS Text, and OpenType for font designs, and various OS vendors). Of course, if this development requires encoding additional characters in the UCS for usage of these scripts in plain-text, ISO working groups will be involved too and will need to agree with Unicode (but we know that this can take many years after proposing encoding new scripts or desunifying any existing script).
Why Julia returns "\uf8ff" when I use (Apple logo) unicode?
I thought Julia supports raw unicode input, such as: julia> test = "π£¢∞§" "π£¢∞§" julia> 😘 = 1 ; julia> print(😘 ) 1 However, it seems julia does not support (Apple logo). julia> = 123 ERROR: syntax: invalid character "" julia> test = "" "\uf8ff" I wonder what's the underlying reason for that, and whether there is a way I can use character in Julia?
I believe this link more properly explains the case of the unicode character that you see as apple's logo. The problem is that the unicode value used is one of several that is set aside for private use. That means that each operating system, or application, or implementation is free to use those unicode characters for anything they want. It just so happens that Apple has chosen to use unicode character U+F8FF (decimal value 63743, or on the web as either or ) as the Apple Logo. But some Windows fonts put in a Windows logo. And some other fonts put in a Klingon Mummification glyph. Or elven script. Or anything they want. And if it isn't defined in your local font, you'll just see a square. My opinion is that Julia simply doesn't use this special value for anything. This also explains why your "π£¢∞§" characters work nicely - they are proper unicode characters, more largely supported by different platforms. As a side note, i too see a simple square instead of the apple logo on this instance. Edit Here is a list of unicode characters supported by Julia.
To expand on Alex's answer... Apple's logo () isn't an official Unicode symbol. I think there are very few commercial logos and symbols in the main Unicode tables. However, Unicode provides some 'anything goes' areas (called PUAs - private use areas) that companies and individuals can fill with their own symbols, so that their users can access certain special glyphs. The main PUA is U+E000 to U+F8FF. Depending on which font you're using, you'll find all kinds of stuff assigned to these codes. On a Mac, I can usually get the Apple logo at "\uf8ff", with the right font selected, but not the Ubuntu symbol or the Windows logo, unless I choose another font. (There's also a fallback mechanism, whereby if you request a code point that the current font doesn't have, the OS will find a suitable substitute in another font and use that.) [ In Julia, you can only use certain Unicode characters for variable names. Julia wouldn't allow anything from the private use area anyway, unless some fonts were distributed to every computer and everyone agreed on who had which Unicode point. (Mathematica makes extensive use of PUA symbols in their notebooks, because they can and do install their own fonts, and can then access various glyphs from the PUA in the notebook with guaranteed results.) You are allowed to use emoji characters as variable names, so you could try the Emoji apple, rather than the Apple apple:
How UTF8/Unicode adapt to new writing systems?
An example to clarify my question: The Hongkongers' native language is Cantonese, however, we all write in a different language: Madarin Chinese. Two languages are kindof similar, and Hongkongers are educated to write in Madarin Chinese language. Cantonese doesn't have a writing system. Though we are still happy with Madarin as our writing language, however, in case one day Hongkongers decided to develop a 'Cantonese script' which contains not-yet-existing characters, how should UTF8/Unicode/fonts change, to adapt these new characters? I mean, who will change the UTF8/Unicode/fonts standard? How exactly Linux/Windows OS have to be modified, in order to display these newly created characters? (The example is just to make my question clear. We're not talking about politics ;D )
The Unicode coding space has over 1,000,000 code points, and only about 10% of them have been allocated, so there is a lot of room for new characters (even though some areas of the coding space have been set apart for use other than added characters). The Unicode Consortium, working in close cooperation with the relevant body at ISO, assigns code points to new characters on the basis of proposals that demonstrate actual usage or, in some cases, plans with a solid basis and widespread support. Thus, if a new script were designed and there was a large community that would seriously use it, it would be added, with its characters, into Unicode after due proposals and discussion. It would then be up to font manufacturers to add glyphs for such characters. This might take a long time, but if there is strong enough need, new fonts and enhancements to existing fonts would emerge. No change to UTF-8 or other Unicode transfer encodings would be needed. They already encode the entire coding space, whether code points are assigned to characters or not. Rendering software would need no modifications, unless there are some specialties in the writing system. Normal characters would be rendered just fine, as soon as suitable fonts are available. However, if the characters added were outside the Basic Multilingual Plane (BMP), the “16-bit subset of Unicode”, both rendering and processing (and input) would be problematic. Many programming languages and programs effectively treat Unicode as if it were a 16-bit code and run into problems (possibly solvable, but still) when characters outside the BMP are used. If the writing system had, say, 10,000 characters, it is quite possible that it would have to allocated outside the BMP.
The Unicode committee adds new characters as they see fit. Then fonts add support for the new characters. Operating systems should not require changes simply to display the new characters. Typing the characters would generally require updates or plug-ins to an operating system's input methods.
Using unicode / utf-8 in programmers editors
There are a lot of programmers editors that claim to support unicode / utf-8. I've tried a number of them (UltraEdit, jedit, emedit) but none of them tell you how to actually enter unicode characters into a file. Some of them tell you how to change the default file encoding to utf-8 or how to select a font that has good support for utf-8, but not how to enter utf-8 into a file using their editor. The Go language (and some others) support utf-8 and I like the idea of using the actual utf-8 symbols for variables instead of variables with names like omega. I haven't found a programmers editor yet that actually allows you to do this, though. The only editor / word processor that I've found that lets you how to enter unicode is Microsoft Word. Type the unicode and Alt+X and Word converts it. To get the Greek letter omega type "03c9" followed by Alt+X. UltraEdit will let you copy utf-8 from a web page into it, but their docs don't say how to actually enter utf-8 in a file, and their tech. support people don't know either. This should be simple, but seems to be completely undocumented. Is there some key combination convention the lets you enter unicode into these editors that supposedly support unicode the way that Ctrl-F is widely used for search? Thanks.
The standard programmer’s editor vim(1) supports limited Unicode input even if your operating system should be too broken to do so (are there any such, still?). Just enter ^VuXXXX, where XXXX represents exactly four hex digits. That will allow you to enter the ~6% of Unicode allocated to the Basic Multilingual Plane. The rest are forbidden to you. This may be fixed in a newer release. Otherwise, just use your mouse.
A few techniques I use if an editor is lacking: Use the Windows charmap.exe utility to select characters and paste into a document. Install an input method editor (IME) to write in a particular language. Windows ALT keycodes.
Better to set your keyboard to generate Unicode characters across all Windows applications than to rely on a single application's custom input feature IMO. Use the EnableHexNumpad feature and you can type any character in the Basic Multilingual Plane using Alt+numbad-plus,hexcode. (May not be of much use on a laptop without a numpad though.) Or if there are particular characters you want to type a lot, find a keyboard layout that allows you to type them directly. For example eurokb might cover it, or you can make your own with MSKLC.
Old question, but you can type a lot of unicode in GNU Emacs or Vim GNU Emacs: M-x set-input-method RET tex (or C-x RET C-\ tex) will let you type \omega to generate ω Vim: Vim digraphs can generate unicode; C-k w * in insert mode gives you ω.
deceze hit the nail on the head. (S)he just didn't elaborate. bobince gave a bit more. And I'm hazarding a guess that you're a developer or tester working on L14N or I18N. I'm also guessing you need to do more than just a few characters here or there, or you'd be satisfied with pasting from another app. So, I'll share some advice. (note: here, "you" refers to the next person to look here. I'm sure the original poster doesn't care anymore by now. :-)) If you're on Windows 10, install an appropriate keyboard driver that lets you input the characters you want into any application. I'm sure Linux has support for the same sort of thing. E.g. I'm teaching myself Hindi (हिंदी), so I installed Windows' Hindi (Devanangari) support. I typed "Hindi", in Hindi using that support, then I switched back to US English to do the rest of this post. If all you need are accented characters from Western European languages, you can install the INTL English support and type directly in español or français or whatever. Don't look at entering Unicode characters as entering some sort of special data amidst your English text. It's just someone else's language. Use their keyboard. Type their language. I'm writing a flashcard app to help my learning. I'm using the Hindi keyboard support to type characters into Word, WordPad, Excel, and the Visual Studio editor. And that Hindi keyboard support works exactly the same way in all of those apps, as I'd expect it to work in just about any text editor that supports Unicode. And as you saw above, it also works in a simple text edit control in Chrome. No copy and paste. No remembering special codes. It's as ubiquitous as ctrl-F.
It looks like the unicode support in programmers editors (except for some Microsoft products) is mostly read-only. They can open a file with unicode and display the characters, but typing unicode into a file is a different story. If you want to enter unicode in a programmers editor you can copy it from somewhere else (a web page or Microsoft Word or Notepad) and paste it into the editor, but the editors make typing unicode difficult or impossible. UltraEdit tech support referred me to this web page which explains a lot. Unfortunately none of the solutions worked with UltraEdit. Microsoft Word and Notepad support unicode entry. Type the unicode value followed by Alt+X and it converts the hexadecimal and displays it. You can then copy and paste it into UltraEdit or one of the other programmers editors. As others have mentioned unicode support depends on support within the operating system as well as the editor. What got me interested in using unicode in source code files is Mark Summerfield's book Programming in Go. He includes an example .go file that uses unicode. It would be great to use unicode Greek characters for variable names instead of variables named "omega" or "theta". Using unicode in source code is a bad idea, however. Support for unicode in programmers editors is lousy, and developers would have to save or convert their source code files to utf-8 instead of ASCII. Developer's tools are just not ready to write code in unicode no matter how neat the idea sounds.
Understanding the terms - Character Encodings, Fonts, Glyphs
I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web: Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ] Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number. Glyphs -> These are visual representation of characters stored in the font files. And based on the above understanding I have the below questions, 1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p) 2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file] 3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports? I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources: Unicode, writing systems, etc. The best source of information would probably be this book by Jukka: Unicode Explained If you were to follow the link, you'd also find these books: CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read. Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic. Internationalization If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know): Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating. You might be interested in some blogs or web sites on the topic: Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform Global by design - John Yunker is actively posting bits of i18n knowledge to this site Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff. Java Internationalization I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete. JavaScript Internationalization If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize. The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book: Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs. In the Unicode Character Encoding Model, there are 4 levels: Abstract Character Repertoire (ACR) — The set of characters to be encoded. Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points. Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units. Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes. For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES. In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code. 1) For the OS to understand an encoding, should it be installed separately?. The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level. 2)Will a font always have the complete implementation of a code page or just part of it?. In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions. It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string. No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs. I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one. The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine. Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data. This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!