Is there such a thing as an "Encoding Language"? - encoding

I know that there are at least two types of coding languages: markup and programming. HTML is an example of the former, Python an example of the latter.
Is there such a thing as an encoding language? An example of this could be Unicode.
Here's a concept tree I made to help illustrate my point:

Unicode and ASCII are character sets and no languages, so they only define the amount of symbols you can use and display.
For the other two (Markup and Programming Languages) it depends on your definition of language. Maybe this is interesting for you: formal languages

Related

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Programming in a language other than English

I was having a discussion on twitter about adding the ability of Ruby to use λ instead of lambda, and more generally about Unicode support. I realized that all the languages I know work only with English reserved words and mostly assume a us-en keyboard (for example using $ instead of £ or ¥). While some languages are now starting to have some support for Unicode in there string functions, there are still so many convention based on English or the Latin style character set. For example Ruby requires class names begin with an upper case letter, but upper and lower case is not a property of glyphs in most scripts.
So the question is: "Are there programming languages that work in a large set of languages, and how do they do it?"
You can have a look ant the APL programming language, for example.
Some languages define very simple syntaxes and little or no keyworks. For example, LISPs and languages that function like them (Tcl, etc...) where everything is "command arg1 ... argn". These languages, since there are no keywords per se, are language agnostic.
For example, in Tcl, you can rename the various commands to use whatever language you want and everything should work perfectly.
Python 3 is completely Unicode-based, so identifiers can be constructed out of any Unicode letters/digits etc.
It's still not a good idea to use characters for function names that programmers from other nations don't have easy access to on their keyboards.
In the 3.0.0 release of the Parrot VM, they added support for a language, Ωη;)XD that is named using unicode which caused all kinds of breakage for the VM. It might be worth taking a look at.

If Ascii operators are definable, why not Unicode Symbols?

I'm sure I join many in being glad there's finally a powerful language tied tightly to a mainstream GUI/Database/Communication framework.
I haven't been sure where to post this, but here seems the best spot.
I need to use Unicode symbol characters either as operators or as function names. I'd like syntactic sugar, but I don't need it.
Guy Steele pointed out in Communications of the ACM that "*" was a forced choice when it was adopted from Ascii as multiply, but my software works in Unicode, so I'm not tethered to Ascii anymore.
!$%&*+-./<=>?,#^|~:
Part of localization includes local programmers. Why limit the set of operators that can be defined in F#? It isn’t orthogonal to C#'s and F#'s acceptance of many Unicode IsLetter in identifiers.
Also, F# is likely to be used for symbolic manipulation of problems from logic, math, physicists, etc. It makes work much easier if there’s a direct mapping into the language of the basic operators. (F# and C# accept many Unicode IsLetter? as well as IsDigit’? This is a request to allow Unicode IsSymbol? As operators with the precedence of, for example, *, or, since “+” is both a unary and binary operator, I could put up with the precedence of + and make up the difference with parenthesized groupings.
Consider the domain-specific needs of logicians, mathematicians, physicists, etc. I’d rather write a symbolic differentiator or integrator using math symbols than Ascii permutations of already-taken operators.
Logic: ∀ ∃ ⇒
Math: ∑ ∫ ∂
Group theory: ≤ ≥ ∈ ∉
Set Theory: ⊆ ⊇ ⊃ ∪ ∩
Tensors: ⊗
I’ve written many languages in other languages, but because F# is tightly .Net-integrated, this issue poses special challenges without language support:
It’s trivial to cobble up a translator that takes Unicode-operator F# source and maps it, line-by-line, to Ascii-operator F# source.
But when debugging, how do I make sure the programmer still sees their untranslated source? And that they can see variable values.
Operators and converts them is trivial. But how do I ensure the translation is what gets compiled, while the programmer sees their own source? If I map line-for-line correctly, how do I ensure they can still point at a variable and see its value?
There is a math (Unicode) symbol extension for F# available in the Visual Studio Gallery.
This allows you to define Unicode symbols, e.g.:
let inline (~∑) xs = xs |> Seq.sum
let total = ∑myList
You may be interested in Project Fortress which is a new functional programming language that embraces the Unicode character set (among many other features). In particular, see the Mathematical Syntax in Fortress page which contains some sample code.
For an interesting discussion on this check: http://cs.hubfs.net/forums/thread/9690.aspx
Other languages, such as Scala, do permit operators from outside the ASCII range -- mathematical symbols(Sm) and other symbols(So)

What do I need to know about Unicode? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Being a application developer, do I need to know Unicode?
Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world's
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.
First, go to the source for
authoritative, detailed information
and implementation guidelines.
As mentioned by others, Joel Spolsky
has a good list of these
errors.
I also like Elliotte Rusty Harold's
Ten Commandments of Unicode.
Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
Glyphs—concrete graphics used to represent written characters.
Composition—combining glyphs to create another glyph.
Encoding—converting Unicode points to a stream of bytes.
Collation—locale-sensitive comparison of Unicode strings.
At the risk of just adding another link, unicode.org is a spectacular resource.
In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.
(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)
Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.
One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).
You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.
E.G :
You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters.
PHP let you now that some function (like stristr) does not work with unicode.
Python declare unicode string this way : u"Hello World".
That's the kind of thin you must know.
Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.
Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).
So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.
Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.
Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.
Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.
Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".
There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?
Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).
My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).
I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.
I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches
It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.
I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.