Is there a character set other than EBCDIC that is not a superset of 7-bit ASCII?
Yes. JIS X 0208 is not a superset of ASCII. Some versions of this standard include most of the ASCII characters, but not all of them.
A related fact is that a file encoded with UTF-16 or UTF-32 is not byte-equivalent to an ASCII file of the same characters, but since those are not character sets, and since Unicode is certainly a superset of ASCII, they do not qualify as answers to your question.
There were some ISO 646 derivatives that were basically ASCII, but replaced various punctuation marks with various accented characters. There were also Greek, Cyrillic, Arabic and Hebrew sets that laid those characters over the Latin characters; see http://en.wikipedia.org/wiki/ISO_646 for details.
Related
I have read Joel's article about encodings. As I understand in case of unicode:
unicode is a charater set - mapping between integer value and character
utf-8 is an encoding which is used for unicode integers to present them in binary view
What about iso-8859-1? Is it encoding or character set or both?
ISO 8859-1 (Latin-1) is a single-byte encoding. It represents the first 256 Unicode characters. So, as long as it is subset of Unicode character set, I suppose it could be treated as both encoding and character set.
What about iso-8859-1? Is it encoding or character set or both?
Historically, it was described as a coded character set: it defined both a set of characters, and a mapping of those characters to byte values — what we would today call an encoding, but it was not explicitly described in those terms.
When Unicode was created, it was designed to encompass (nearly) all characters in widely-used character sets, and hence it recast the byte stream defined by the ISO-8859-1 coded character set as an encoding of the wider Universal Character Set.
So if you are working in a modern Unicode environment you would consider ISO-8859-1 to be an encoding. But it can't really be said to be wrong to consider it also a character set.
(There are other encodings which are definitely not character sets: for example the UTFs, and multibyte encodings like Shift-JIS, which was itself defined as an encoding for the JIS X 0208 character set prior to Unicode's extend-and-embrace.)
Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding.
It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)
And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.
Unicode and ASCII are both Codepoints + Encoding scheme
Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.
Conversion and Representation(in binary/hexadecimal) of String:
String := sequence of Graphemes(character is a "kind of" its subset).
Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes
for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.
Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)
ASCII supports 128 character codepoints(mostly english)
I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.
Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?
ISO 8859-1 is not Unicode
ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.
However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.
Control characters
Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.
In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).
ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.
All ASCII is UTF-8 Unicode.
All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.
All UTF-8 single-byte character are contained in ASCII.
UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.
No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.
Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.
No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.
If you look at code formats you have something like
Abstract letters - The letters you are using
Code table - Bring the letters in some form (like alphabetic ordering)
Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16)
[character encoding of steering instruction (e.g. < in XML)]
It depends on how you define "Unicode format."
I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).
In that case, no, ISO 8859-1 is not a Unicode format.
However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.
Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.
Here you can find a more complete answer: Unicode.org
What exactly are unicode character codes? And how are they different from ascii characters?
Unicode is a way to assign unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. There are many ways to encode Unicode strings as bytes, such as UTF-8 and UTF-16.
ASCII assigns values only to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters).
For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.
In most modern applications you should prefer to use Unicode strings rather than ASCII. This will for example allow you to have users with accented characters in their name or address, and to localize your interface to languages other than English.
The first 128 Unicode code points are the same as ASCII. Then they have a 100,000 or so more.
There are two common formats for Unicode, UTF-8 which uses 1-4 bytes for each value (so for the first 128 characters, UTF-8 is exactly the same as ASCII) and UTF-16, which uses 2 or 4 bytes.
Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.