I have a piece of Java code that is checking it is between two unicode characters:
LA(2) >= '\u0003' && LA(2) <= '\u00ff'
I understand that \u0003 represents END OF TEXT and \u00ff is LATIN SMALL LETTER Y WITH DIAERESIS, but what lies between these points? (what is it checking that LA(2) is?)
e.g. is it all Latin characters, or number characters, or characters with accents, all ascii characters, or something else?
It's Latin 1 minus the code points U+0000, U+0001 and U+0002. This includes the usual stuff that can be found on the US keyboard, plenty of control characters (below U+0020 and between U+007F and U+009F) and a few other Latin characters that can be used to write the majority of Western European languages.
The following ranges are declared:
0000 - 007F C0 Controls and Basic Latin
0080 - 00FF C1 Controls and Latin-1 Supplement
To check out which unicode value represents which character, I advise to have a look at one of the following links:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
http://unicode.org/
It's the basic latin1 character set except the first 3 codes.
0x0000 - 0x007F : Basic Latin (128)
0x0080 - 0x00FF : Latin-1 Supplement (128)
The code probably checks whether the character can be output as a single byte char (latin1 encoded).
Related
Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.
So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"?
The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this ->
Character : Ç
Character name : LATIN CAPITAL LETTER C WITH CEDILLA
Hex code point : 00C7
Decimal code point : 199
Hex UTF-8 bytes : C387
......
Where/How can we save this info to be a standard utf-8 char?
How can we distribute/expose it (make ready to be used by others) ?
Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium)
How can we detect/fixup errors if they are already registered but not working correctly?
Can we have custom-utf8 configuration? If yes how?
Note : No code snippet is needed here as it is not mis-usage problem.
The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
Code: 00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc
This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.
Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:
Locale tr = new Locale("TR","tr"); // Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); // ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); // ççğğiıööşşüü
Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.
Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.
Update 2:
The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.
Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.
I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.
Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?
ISO 8859-1 is not Unicode
ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.
However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.
Control characters
Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.
In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).
ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.
All ASCII is UTF-8 Unicode.
All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.
All UTF-8 single-byte character are contained in ASCII.
UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.
No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.
Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.
No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.
If you look at code formats you have something like
Abstract letters - The letters you are using
Code table - Bring the letters in some form (like alphabetic ordering)
Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16)
[character encoding of steering instruction (e.g. < in XML)]
It depends on how you define "Unicode format."
I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).
In that case, no, ISO 8859-1 is not a Unicode format.
However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.
Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.
Here you can find a more complete answer: Unicode.org
I'm giving a tech talk about Unicode and encoding in my company, in which I'm trying to make the point that strings are always encoded, and developers should never carelessly assume that everything is 0-127 ASCII.
I have numerous examples of problems caused by mis-encoded text, but I didn't find any example of simple English text with numbers that's encoded above Unicode code point 127.
The basic English alphabet is mapped in Unicode to the same numerical value as the plain old ASCII: The range A-Z is mapped to [65-90] (or [0x41-0x5a] in hex), and [a-z] is mapped to [97-122] (hex [0x61-0x7a]).
Does the English alphabet appear elsewhere in the code charts? I do not mean circumflex letters or other Latin variants, just the plain English alphabet.
CJK characters are generally monospaced in all fonts, since that's how those languages tend to be written.
When mixing CJK and English characters, however, you run into a problem: ASCII characters do not in general have the width of a CJK character. This means that if you use ASCII, you lose the monospaced property - which may not always be desirable.
For this purpose, fullwidth characters (U+FF00-FFEE, Wikipedia, Unicode code chart) may be used in place of "regular" characters. These have the property that they have the same width as a single CJK character.
Note, however, that fullwidth characters are virtually never used outside of a CJK context, and even in those contexts, plain ASCII is frequently used as well, when monospacing is considered unimportant.
Plenty of punctuation and symbols have code point values above U+007F:
“Hello.”
He had been given the comprehensive sixty-four-crayon Crayola box—including the gold and silver crayons—and would not let me look.
x ≠ y
The above examples use:
U+201C and U+201D — smart quotes
U+2014 — em-dash
U+2260 — not equal to
See the Unicode charts for more.
Well, if you just mean a-z and A-Z then no, there are no English characters above 127. But words like fiancé, resumé etc are sometimes spelled like that in English and use codepoints above 127.
Then there are various punctuation signs, currency symbols and so on that are above 127. Not sure if this counts as simple English text.
I am confused about the text encoding and charset. For many reasons, I have to
learn non-Unicode, non-UTF8 stuff in my upcoming work.
I find the word "charset" in email headers as in "ISO-2022-JP", but there's no
such a encoding in text editors. (I looked around the different text editors.)
What's the difference between text encoding and charset? I'd appreciate it
if you could show me some use case examples.
Basically:
charset is the set of characters you can use
encoding is the way these characters are stored into memory
Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. A charset is simply what it sounds like, a set of characters. There are a large number of charsets, including many that are intended for particular scripts or languages.
However, we are well along the way in the transition to Unicode, which includes a character set capable of representing almost all the world's scripts. However, there are multiple encodings for Unicode. An encoding is a way of mapping a string of characters to a string of bytes. Examples of Unicode encodings include UTF-8, UTF-16 BE, and UTF-16 LE . Each of these has advantages for particular applications or machine architectures.
Throwing more light for people visiting henceforth, hopefully it would be helpful.
Character Set
There are characters in each language and collection of those characters form the “character set” of that language. When a character is encoded then it is assigned a unique identifier or a number called as code point. In computer, these code points will be represented by one or more bytes.
Examples of character set: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)
Coded Character Set
A coded character set is a set in which a unique number is assigned to each character. That unique number is called as "code point".
Coded character sets are sometimes called code pages.
Encoding
Encoding is the mechanism to map the code points with some bytes so that a character can be read and written uniformly across different system using same encoding scheme.
Examples of encoding: ASCII, Unicode encoding schemes like UTF-8, UTF-16, UTF-32.
Elaboration of above 3 concepts
Consider this - Character 'क' in Devanagari character set has a decimal code point of 2325 which will be represented by two bytes (09 15) when using the UTF-16 encoding
In “ISO-8859-1” encoding scheme “ü” (this is nothing but a character in Latin character set) is represented as hexa-decimal value of FC while in “UTF-8” it represented as C3 BC and in UTF-16 as FE FF 00 FC.
Different encoding schemes may use same code point to represent different characters, for example in “ISO-8859-1” (also called as Latin1) the decimal code point value for the letter ‘é’ is 233. However, in ISO 8859-5, the same code point represents the Cyrillic character ‘щ’.
On the other hand, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15)
A character encoding consists of:
The set of supported characters
A mapping between characters and integers ("code points")
How code points are encoded as a series of "code units" (e.g., 16-bit units for UTF-16)
How code units are encoded into bytes (e.g., big-endian or little-endian)
Step #1 by itself is a "character repertoire" or abstract "character set", and #1 + #2 = a "coded character set".
But back before Unicode became popular and everyone (except East Asians) was using a single-byte encoding, steps #3 and #4 were trivial (code point = code unit = byte). Thus, older protocols didn't clearly distinguish between "character encoding" and "coded character set". Older protocols use charset when they really mean encoding.
A character set, or character repertoire, is simply a set (an unordered collection) of characters. A coded character set assigns an integer (a "code point") to each character in the repertoire. An encoding is a way of representing code points unambiguously as a stream of bytes.
Googled for it.
http://en.wikipedia.org/wiki/Character_encoding
The difference seems to be subtle. The term charset actually doesn't apply to Unicode. Unicode goes through a series of abstractions.
abstract characters -> code points -> encoding of code points to bytes.
Charsets actually skip this and directly jump from characters to bytes.
sequence of bytes <-> sequence of characters
In short,
encoding : code points -> bytes
charset: characters -> bytes
A charset is just a set; it either contains, e.g. the Euro sign, or else it doesn't. That's all.
An encoding is a bijective mapping from a character set to a set of integers. If it supports the Euro sign, it must assign a specific integer to that character and to no other.
In my opinion, a charset is part of an encoding (a component), encoding has a charset attribute, so a charset can be used in many encodings. For example, Unicode is a charset used in encodings like UTF-8, UTF-16 and so on. See illustration here:
The char in charset doesn't mean the char type in the programming world. It means a character in the real world. In English it maybe the same, but in other languages not, like in Chinese, '我' is a inseparable 'char' in charsets (Unicode, GB [used in GBK and GB2312]), 'a' is also a char in charsets (ASCII, ISO-8859, Unicode).
In my opinion the word "charset" should be limited to identifying the parameter used in HTTP, MIME, and similar standards to specify a character encoding (a mapping from a series of text characters to a sequence of bytes) by name. For example:charset=utf-8.
I'm aware, though, that MySQL, Java, and other places may use the word "charset" to mean a character encoding.
An encoding is a mapping between bytes and characters from a character set, so it will be helpful to discuss and understand the difference between between bytes and characters.
Think of bytes as numbers between 0 and 255, whereas characters are abstract things like "a", "1", "$" and "Ä". The set of all characters that are available is called a character set.
Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.
Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.
For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.
Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║ Character ║
╠══════╬══════════════║
║ 60 ║ < ║
║ 61 ║ = ║
║ 62 ║ > ║
║ 63 ║ ? ║
║ 64 ║ # ║
║ 65 ║ A ║
╚══════╩══════════════╝
In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).
However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.
Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.
One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.
For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.
When computers store data about characters internally or transmit it to another system, they store or send bytes. Imagine a system opening a file or receiving message sees the bytes 195, 162. How does it know what characters these are?
In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used. That is why encoding appears in XML headers or can be specified in a text editor. It tells the system the mapping between bytes and characters.
Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.