What is Unicode? and how Encoding works? [closed] - unicode

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Few hours before I was reading a c programming book. While I was reading the book I came across these words, Character encoding and Unicode. Then I started googling for the information about Unicode. Then I came to know that Unicode character set has every character from every language and UTF-8,16,32 can encode the characters listed in unicode character set.
but I was not able to understand how it works.
Does unicode depends upon the operating systems?
How it is related to softwares and programs?
Is UTF-8 is a software that is installed on my computer when i installed operating system?
or Is it related to hardware?
and how a computer encodes the things?
I have found it so much confusing. Please answer me in detail.
I am new to these things, so please keep that in mind while you give me the answer.
thank you.

I have written about this extensively in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. Here some highlights:
encodings are plentiful, encodings define how a "character" like "A" can be encoded as bits and bytes
most encodings only specify this for a small number of selected characters; for example all (or at least most) characters needed to write English or Czech; single byte encodings typically support a set of up to 256 characters
Unicode is one large standard effort which has catalogued and specified a number ⟷ character relationship for virtually all characters and symbols of every major language in use, which is hundreds of thousands of characters
UTF-8, 16 and 32 are different sub-standards for how to encode this ginormous catalog of numbers to bytes, each with different size tradeoffs
software needs to specifically support Unicode and its UTF-* encodings, just like it needs to support any other kind of specialized encoding; most of the work is done by the OS these days which exposes supporting functions to an application

Related

Why are there multiple versions of Unicode? Why isn't everything UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Again and again, I keep asking myself: Why do they always insist on over-complicating everything?!
I've tried to read up about and understand Unicode many times over the years. When they start talking about endians and BOMs and all that stuff, my eyes just "zone out". I physically cannot keep reading and retain what I'm seeing. I fundamentally don't get their desire for over-complicating everything.
Why do we need UTF-16 and UTF-32 and "big endian" and "little endian" and BOMs and all this nonsense? Why wasn't Unicode just defined as "compatible with ASCII, but you can also use multiple bytes to represent all these further characters"? That would've been nice and simple, but nooo... let's have all this other stuff so that Microsoft chose UTF-16 for Windows NT and nothing is easy or straight-forward!
As always, there probably is a reason, but I doubt it's good enough to justify all this confusion and all these problems arising from insisting on making it so complex and difficult to grasp.
Unicode started out as a 16-bit character set, so naturally every character was simply encoded as two consecutive bytes. However, it quickly became clear that this would not suffice, so the limit was increased. The problem was that some programming languages and operating systems had already started implementing Unicode as 16-bit and they couldn’t just throw out everything they had already built, so a new encoding was devised that stayed backwards-compatible with these 16-bit implementations while still allowing full Unicode support. This is UTF-16.
UTF-32 represents every character as a sequence of four bytes, which is utterly impractical and virtually never used to actually store text. However, it is very useful when implementing algorithms that operate on individual codepoints – such as the various mechanisms defined by the Unicode standard itself – because all codepoints are always the same length and iterating over them becomes trivial, so you will sometimes find it used internally for buffers and such.
UTF-8 meanwhile is what you actually want to use to store and transmit text. It is compatible with ASCII and self-synchronising (unlike the other two) and it is quite space-efficient (unlike UTF-32). It will also never produce eight binary zeroes in a row (unless you are trying to represent the literal NULL character) so UTF-8 can safely be used in legacy environments where strings are null-terminated.
Endianness is just an intrinsic property of data types where the smallest significant unit is larger than one byte. Computers simply don’t always agree in what order to read a sequence of bytes. For Unicode, this problem can be circumvented by including a Byte Order Mark in the text stream, because if you read its byte representation in the wrong direction in UTF-16 or UTF-32, it will produce an invalid character that has no reason to ever occur, so you know that this particular order cannot be the right one.

Does the Unicode Consortium Intend to make UTF-16 run out of characters? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.
Does the Unicode Consortium Intend to make UTF-16 run out of characters?
i.e. make a code point > 0x10FFFF
If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.
Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?
As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):
leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.
In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF
Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):
CJK extension E(~10,000 chars)
Ferengi culture characters(~5,000 chars)
At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.
Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.
Cutting to the chase:
It is indeed intentional that the encoding system only supports code points up to U+10FFFF
It does not appear that there is any real risk of running out any time soon.
There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.

how are non-english programming/scripting languages developed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
how are non-english programming/scripting languages developed ?
do you need to be a computer scientist ?
You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.
Study unicode -- learn to use ICU -- or a language with GOOD Unicode support.
Decide on and Build a VM (or use an existing one).
Write a lexxer / parser or use something like ANTLR (Java based) .
decide on a AST
Generate the instruction stream for the VM.
check out "Principles of Compiler Design"
You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.
There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode
Some examples are the following:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - mandates UTF8 encoding
Mail Markup Language - http://mailmarkup.org/ - mandates UTF16BE encoding

What do I need to know about Unicode? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Being a application developer, do I need to know Unicode?
Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world's
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.
First, go to the source for
authoritative, detailed information
and implementation guidelines.
As mentioned by others, Joel Spolsky
has a good list of these
errors.
I also like Elliotte Rusty Harold's
Ten Commandments of Unicode.
Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
Glyphs—concrete graphics used to represent written characters.
Composition—combining glyphs to create another glyph.
Encoding—converting Unicode points to a stream of bytes.
Collation—locale-sensitive comparison of Unicode strings.
At the risk of just adding another link, unicode.org is a spectacular resource.
In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.
(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)
Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.
One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).
You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.
E.G :
You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters.
PHP let you now that some function (like stristr) does not work with unicode.
Python declare unicode string this way : u"Hello World".
That's the kind of thin you must know.
Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.
Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).
So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.
Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.
Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.
Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.
Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".
There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

Need a list of languages that are supported completely by ASCII encoding [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am writing an article on Unicode and discussing the advantages of this encoding scheme over outdated methods like ASCII.
As part of my research I am looking for a reference that listed the languages that could be fully represented using only the characters supported by ASCII. Haven't had much luck tracking it down with Google and I thought I'd tap the collective knowledge of SO to see if anyone had a reasonable list.
Key points:
All languages listed must be able to
be completely represented using the character set available in ASCII.
I know this won't be comprehensive,
but I am mostly interested in the
most common written languages.
There are no natural languages that I know of that can be fully represented in ASCII. Even American English, the language for which ASCII was invented, doesn't work: for one, there are a lot of foreign words that have been integrated into the American English language that cannot be represented in ASCII, like resumé, naïve or a word that probably every programmer uses regularly, schönfinkeln.
And two, ASCII is missing pretty much all typographic characters like “quotation marks”, dashes of various lengths (– and —), ellipses (…), thin and wide spaces and so on, all of which are used in American English.
IIRC from my Latin classes, the macrons in Latin are later additions by people studying meters in Latin poetry; they wouldn't have been used in every-day writing. So you've got Latin.
Given loan words, I don't think there are any such languages. Even ugly Americans know the difference between "resume" and "résumé".
I assume you mean natural languages and only 7 bit ASCII?
In that case the list is quite small. Mostly english.
Some constructed languages such as Interlingua and Ido are designed to use only ASCII characters. ‘Real’ languages in everyday use tend to use characters outside the ASCII range, at the very least for loanwords.
Not a widely used language, but Rotokas can be written using only ASCII letters. See http://en.wikipedia.org/wiki/Rotokas_alphabet