Why this elaborate RTF encoding of an apostrophe? - unicode

Scrivener produces RTF files with this elaborate apostrophe encoding:
They didn\loch\af0\hich\af0\dbch\af0\uc1\u8217\'92t do it.
Unicode 8217 is "Right Single Quotation Mark". Okay, but this RTF has that unicode character and \'92 as well. What's going on here?

That RTF breaks down to the following:
They didn - plain text
\loch - The text consists of single-byte low-ANSI (0x00–0x79) characters
\af0 - Associated Font Number 0
\hich - The text consists of single-byte high-ANSI (0x80–0xFF) characters
\af0 - Associated Font Number 0
\dbch - The text consists of double-byte characters
\af0 - Associated Font Number 0
\uc1 - number of bytes corresponding to a given \uN Unicode character
\u8217 - a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page
\'92 - A hexadecimal value, based on the specified character set (may be used to identify 8-bit values).
t do it. - plain text
Some of that is superfluous in this context and can be ignored, it is just font information. What is important is that \u8217 represents the apostrophe in Unicode, \'92 represents an equivalent apostrophe in ANSI, and \uc1 indicates the \'92 takes up 1 character. A Unicode-enabled RTF reader will handle \u8217 and ignore \'92. A non-Unicode RTF reader will ignore \u8217 and handle \'92. This is stated in the RTF spec for Unicode RTF:
\uN
This keyword represents a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.
...
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.

Related

Character set vs codepage layout

1) Can anyone explain me why the ASCII and Latin-1 table is once in the chapter Character Set and once under Code page layout? I am fine if both terms are interchangbly used, but this is still inconsistent, or am I missing something?
2) Are ASCII and Latin-1 fully compatible? 0x00 to 0x1F don't seem to be defined in Latin-1, why?
A character set is a set of notional writing system concepts, such as capital Fraktur Z, line feed, or bicycle symbol. These include typographic style variations that have significant contexts for usage (e.g. mathematics) but not typical typeface (font) variations.
Each codepoint in a character set is an element in a mapping between the "character" and an integer.
A character encoding is an algorithm to convert between a codepoint in the character set and a sequence of one or more code units in the character encoding. Code units are integers. Integers wider than one byte have a byte order (endianness). A code unit is serialized to a sequence of bytes for streaming or storage. Character encoding functions often map both steps at once: between a codepoint and bytes.
Many character sets have one character encoding. Many character encodings have single-byte code units. This makes them easy to present with the concepts of codepoint, code unit and byte collapses as well as character set and character encoding collapsed.
This all has a long history. Terminology, focus and standards have evolved. The context can be a clue as to what is meant. "Code page" is/was often used when identifying a particular extension to ASCII. In some original standards, only the differences or extensions were documented. Vendor libraries often filled in gaps in the character sets so they would be completely defined over 256 codepoints. When the Unicode character set was being developed, transcoding tables between Unicode and other character set were accepted from vendors. This effectively standardized some character set to 256 codepoints. (You can see the Unicode codepoint in hexadecimal in your tables.)
ASCII and Latin-1 (effectively the same as ISO 8859-1) are compatible in a limited sense:
The first 128 codepoints and code unit values are the same. ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Nobody likes a mess like that. That's why the members of Unicode just took the characters sets as they were used in the field when creating mappings between Unicode and other character sets.

Encoding and character set for iso-8859-1

I have read Joel's article about encodings. As I understand in case of unicode:
unicode is a charater set - mapping between integer value and character
utf-8 is an encoding which is used for unicode integers to present them in binary view
What about iso-8859-1? Is it encoding or character set or both?
ISO 8859-1 (Latin-1) is a single-byte encoding. It represents the first 256 Unicode characters. So, as long as it is subset of Unicode character set, I suppose it could be treated as both encoding and character set.
What about iso-8859-1? Is it encoding or character set or both?
Historically, it was described as a coded character set: it defined both a set of characters, and a mapping of those characters to byte values — what we would today call an encoding, but it was not explicitly described in those terms.
When Unicode was created, it was designed to encompass (nearly) all characters in widely-used character sets, and hence it recast the byte stream defined by the ISO-8859-1 coded character set as an encoding of the wider Universal Character Set.
So if you are working in a modern Unicode environment you would consider ISO-8859-1 to be an encoding. But it can't really be said to be wrong to consider it also a character set.
(There are other encodings which are definitely not character sets: for example the UTFs, and multibyte encodings like Shift-JIS, which was itself defined as an encoding for the JIS X 0208 character set prior to Unicode's extend-and-embrace.)

Is ISO-8859-1 a Unicode charset?

I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.
Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?
ISO 8859-1 is not Unicode
ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.
However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.
Control characters
Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.
In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).
ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.
All ASCII is UTF-8 Unicode.
All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.
All UTF-8 single-byte character are contained in ASCII.
UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.
No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.
Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.
No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.
If you look at code formats you have something like
Abstract letters - The letters you are using
Code table - Bring the letters in some form (like alphabetic ordering)
Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16)
[character encoding of steering instruction (e.g. < in XML)]
It depends on how you define "Unicode format."
I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).
In that case, no, ISO 8859-1 is not a Unicode format.
However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.
Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.
Here you can find a more complete answer: Unicode.org

Are Unicode and Ascii characters the same?

What exactly are unicode character codes? And how are they different from ascii characters?
Unicode is a way to assign unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. There are many ways to encode Unicode strings as bytes, such as UTF-8 and UTF-16.
ASCII assigns values only to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters).
For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.
In most modern applications you should prefer to use Unicode strings rather than ASCII. This will for example allow you to have users with accented characters in their name or address, and to localize your interface to languages other than English.
The first 128 Unicode code points are the same as ASCII. Then they have a 100,000 or so more.
There are two common formats for Unicode, UTF-8 which uses 1-4 bytes for each value (so for the first 128 characters, UTF-8 is exactly the same as ASCII) and UTF-16, which uses 2 or 4 bytes.

Japanese ASCII Code

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.