How to write Chinese / multi-byte characters in ESC/POS? - encoding

I would like to know how to write Chinese / multi-byte characters in ESC/POS.
There is a reference table here:
https://reference.epson-biz.com/modules/ref_charcode_en/index.php?content_id=110
And a guide to how to read the table:
https://reference.epson-biz.com/modules/ref_charcode_en/index.php?content_id=4
which has this diagram:
with the text:
The first column shows the character code of the first character in the row. The first row shows the value to be added for the character code in the column.
However, I am struggling to understand how to read this table, even with this diagram.
For example how do I write this symbol:
It can be found here:
https://reference.epson-biz.com/modules/ref_charcode_en/index.php?content_id=111
on the first row on the 2nd column.
the JIS is: 30-20
the S-JIS is: 88-9E
and the additional value is: 2
So what is the byte value of the chracter.

The value of multibyte characters is 16-bit big endian.
In JIS, the code for 唖 is 0x3020 + 0x0002, which is 0x3022, and in ShiftJIS, 0x889E + 0x0002, which is 0x88A0.
The byte array is 0x30, 0x22 in JIS and 0x88, 0xA0 in Shift JIS, respectively.
By the way, the table and character code are Japanese, not Chinese.

Related

Why this elaborate RTF encoding of an apostrophe?

Scrivener produces RTF files with this elaborate apostrophe encoding:
They didn\loch\af0\hich\af0\dbch\af0\uc1\u8217\'92t do it.
Unicode 8217 is "Right Single Quotation Mark". Okay, but this RTF has that unicode character and \'92 as well. What's going on here?
That RTF breaks down to the following:
They didn - plain text
\loch - The text consists of single-byte low-ANSI (0x00–0x79) characters
\af0 - Associated Font Number 0
\hich - The text consists of single-byte high-ANSI (0x80–0xFF) characters
\af0 - Associated Font Number 0
\dbch - The text consists of double-byte characters
\af0 - Associated Font Number 0
\uc1 - number of bytes corresponding to a given \uN Unicode character
\u8217 - a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page
\'92 - A hexadecimal value, based on the specified character set (may be used to identify 8-bit values).
t do it. - plain text
Some of that is superfluous in this context and can be ignored, it is just font information. What is important is that \u8217 represents the apostrophe in Unicode, \'92 represents an equivalent apostrophe in ANSI, and \uc1 indicates the \'92 takes up 1 character. A Unicode-enabled RTF reader will handle \u8217 and ignore \'92. A non-Unicode RTF reader will ignore \u8217 and handle \'92. This is stated in the RTF spec for Unicode RTF:
\uN
This keyword represents a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.
...
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.

Character set vs codepage layout

1) Can anyone explain me why the ASCII and Latin-1 table is once in the chapter Character Set and once under Code page layout? I am fine if both terms are interchangbly used, but this is still inconsistent, or am I missing something?
2) Are ASCII and Latin-1 fully compatible? 0x00 to 0x1F don't seem to be defined in Latin-1, why?
A character set is a set of notional writing system concepts, such as capital Fraktur Z, line feed, or bicycle symbol. These include typographic style variations that have significant contexts for usage (e.g. mathematics) but not typical typeface (font) variations.
Each codepoint in a character set is an element in a mapping between the "character" and an integer.
A character encoding is an algorithm to convert between a codepoint in the character set and a sequence of one or more code units in the character encoding. Code units are integers. Integers wider than one byte have a byte order (endianness). A code unit is serialized to a sequence of bytes for streaming or storage. Character encoding functions often map both steps at once: between a codepoint and bytes.
Many character sets have one character encoding. Many character encodings have single-byte code units. This makes them easy to present with the concepts of codepoint, code unit and byte collapses as well as character set and character encoding collapsed.
This all has a long history. Terminology, focus and standards have evolved. The context can be a clue as to what is meant. "Code page" is/was often used when identifying a particular extension to ASCII. In some original standards, only the differences or extensions were documented. Vendor libraries often filled in gaps in the character sets so they would be completely defined over 256 codepoints. When the Unicode character set was being developed, transcoding tables between Unicode and other character set were accepted from vendors. This effectively standardized some character set to 256 codepoints. (You can see the Unicode codepoint in hexadecimal in your tables.)
ASCII and Latin-1 (effectively the same as ISO 8859-1) are compatible in a limited sense:
The first 128 codepoints and code unit values are the same. ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Nobody likes a mess like that. That's why the members of Unicode just took the characters sets as they were used in the field when creating mappings between Unicode and other character sets.

How to print escaped hexadecimal in a string in C++?

I have questions related to Unicode, printing escaped hexadecimal values in const char*.
From what I have understood, utf-8 includes 2-, 3- or 4-byte characters, ranging from pound symbol to Kanji characters. Within strings these are represented in hexadecimal values using \u as escape sequence. Also I have understood while using hexadecimal escape in a string, the characters whose value can be included in the escape will be included. For example say "abc\x0f0dab" will include 0f0dab to be considered within \x as hex even though you want only 0f0d to be considered.
Now while writing a Unicode string, say you want to write "abc𤭢def₤ghi", where Unicode for 𤭢 is 0x24B62 and ₤ is 0x00A3. So I will have to compose the string as "abc0x24B62def0x00A3ghi". The 0x will consider all values that can be included in it. So if you want to print "abc𤭢62" the string will be "abc0x24B6262". Won't the entire string be taken as a 4-byte unicode (0x24B6262) value considered within 0x? How to solve this? How to print "abc𤭢62" and not abc(0x24B6262)?
I have a string const char* tmp = "abc\x0fdef";. When I print using printf("\n string = %s", tmp); then it will print abcdef. Where is 0f here? I know the decimal value of \x0f will be stored in the string, i.e. 15, so when we try to print, 15 should be printed right? I mean, it should be "abc15def" but it prints only "abcdef".
I think you may be unfamiliar with the concept of encodings, from reading your post.
For instance, you say "unicode of ... ₤ is 0x00A3". That is true - unicode codepoint U+00A3 is the pound sign. But 0x00A3 is not how you represent the pound sign in, for example, UTF-8 (a particular common encoding of Unicode). Take a look here to see what I mean. As you can see, the UTF-8 encoding of U+00A3 is the two bytes is 0xc2, 0xa3 (in that order).
There are several things that happen between your call to printf() and when something appears on your screen.
First, your program runs the code printf("abc\x0fdef"), and that means that the following bytes in order, are written to stdout for your program:
0x61, 0x62, 0x63, 0x0f, 0x64, 0x65, 0x66
Note: I'm assuming your source code is ASCII (or UTF-8), which is very common. Technically, the interpretation of your source code's character set is implementation-defined, I believe.
Now, in order to see output, you will typically be running this program inside some kind of shell, and it has to eventually transform those bytes into visual characters. It does this by using an encoding. Again, something ASCII-compatible is common, such as UTF-8. On Windows, CP1252 is common.
And if that is the case, you get the following mapping:
0x61 - a
0x62 - b
0x63 - c
0x0f - the 'shift in' ASCII control code
0x64 - d
0x65 - e
0x66 - f
This prints out as "abcdef" because the 'shift in' control code is a non-printing character.
Note: The above can change depending on what exact character sets are involved, but ASCII or UTF-8 is very likely what you're dealing with unless you have an exotic setup.
If you have a UTF-8 compatible terminal, the following should print out "abc₤def", just as an example to get you started:
printf("abc\xc2\xa3def");
Make sense?
Update: To answer the question from your comment: you need to distinguish between a codepoint and the byte values for an encoding of that codepoint.
The Unicode standard defines 'codepoints' which are numerical values for characters. These are commonly written as U+XYZ where XYZ is a hexidecimal value.
For instance, the character U+219e is LEFTWARDS TWO HEADED ARROW.
This might also be written 0x219e. You would know from context that the writer is talking about a codepoint.
When you need to encode that codepoint (to print, or save to file, etc), you use an encoding, such as UTF-8. Note, if you used, for example, the UTF-32 encoding, every codepoint corresponds exactly to the encoded value. So in UTF-32, the codepoint U+219e would indeed be encoded simply as 0x219e. But other encodings will do things differently. UTF-8 will encode U+219e as the three bytes 0xE2 0x86 0x9E.
Lastly, the \x notation is simply how you write arbitrary byte values inside a C/C++ quoted string. If I write, in C source code, "\xff", then that string in memory will be the two bytes 0xff 0x00 (since it automatically gets a null terminator).

Japanese ASCII Code

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.

What multi-byte character set starts with 0x7F and is 4 bytes long?

I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
are there supposed to be Byte Order Marks?
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
0x00-0x7E: plain ASCII
0x7F A B C: Unicode character
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
A*0x1000 + B*0x40 + C
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
Are Unicode and UTF-8 the same?
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.