SMS data field character - unicode

I m looking up the format of GSM SMS.
When PDU mode is used, the TP-UD field is said to be one of the three, 7bit is for ascii symbol, 8 bit is for data, and UCS2 is for the unicode, like Japanese.
There is an example, Hello! has the TP-UD field C8 32 9B FD 0E 01. why? It's not ascii, not GSM03.38 basic character set.
And what if the user data is a mix of ascii character and Japanese, is it unicode for all?
Thanks.

Short message content encoding type(7-bit, 8-bit, 16 bit etc. ) is being chosen by looking the data coding scheme parameter value. If message content consists mixed characters of GSM default alphabet and unicode(e.g. Russian, Arabic, Japan etc), data coding scheme value must be set to 16 bit(UCS-2).
GSM 7-bit default alphabet is for English and several European languages. A limited number of languages, like Portuguese, Spanish, Turkish may use 7-bit encoding with national language shift table defined in 3GPP 23.038. 8-bit encoding is dedicated to binary short messages.
Try Cloudhopper Java SMPP API Charset utility class
msgChars = CharsetUtil.encode("öàß", CharsetUtil.CHARSET_GSM);
msgChars = CharsetUtil.encode("Точно так и было!", CharsetUtil.CHARSET_UCS_2);

Related

How to define/declare utf-8 code points for Turkish special chars (non-ascii) to use them as standard utf-8 encoding?

Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.
So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"?
The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this ->
Character : Ç
Character name : LATIN CAPITAL LETTER C WITH CEDILLA
Hex code point : 00C7
Decimal code point : 199
Hex UTF-8 bytes : C387
......
Where/How can we save this info to be a standard utf-8 char?
How can we distribute/expose it (make ready to be used by others) ?
Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium)
How can we detect/fixup errors if they are already registered but not working correctly?
Can we have custom-utf8 configuration? If yes how?
Note : No code snippet is needed here as it is not mis-usage problem.
The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
Code: 00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc
This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.
Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:
Locale tr = new Locale("TR","tr"); // Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); // ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); // ççğğiıööşşüü
Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.
Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.
Update 2:
The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.
Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.

Is ISO-8859-1 a Unicode charset?

I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.
Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?
ISO 8859-1 is not Unicode
ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.
However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.
Control characters
Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.
In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).
ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.
All ASCII is UTF-8 Unicode.
All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.
All UTF-8 single-byte character are contained in ASCII.
UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.
No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.
Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.
No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.
If you look at code formats you have something like
Abstract letters - The letters you are using
Code table - Bring the letters in some form (like alphabetic ordering)
Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16)
[character encoding of steering instruction (e.g. < in XML)]
It depends on how you define "Unicode format."
I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).
In that case, no, ISO 8859-1 is not a Unicode format.
However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.
Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.
Here you can find a more complete answer: Unicode.org

What string of characters should a source send to disambiguate the byte-encoding they are using?

I'm decoding bytestreams into unicode characters without knowing the encoding that's been used by each of a hundred or so senders.
Many of the senders are not technically astute, and will not be able to tell me what encoding they are using. It will be determined by the happenstance of the toolchains they are using to generate the data.
The senders are, for the moment, all UK/English based, using a variety of operating systems.
Can I ask all the senders to send me a particular string of characters that will unambiguously demonstrate what encoding each sender is using?
I understand that there are libraries that use heuristics to guess at the encoding - I'm going to chase that up too, as a runtime fallback, but first I'd like to try and determine what encodings are being used, if I can.
(Don't think it's relevant, but I'm working in Python)
A full answer to this question depends on a lot of factors, such as the range of encodings used by the various upstream systems, and how well your users will comply with instructions to type magic character sequences into text fields, and how skilled they will be at the obscure keyboard combinations to type the magic character sequences.
There are some very easy character sequences which only some users will be able to type. Only users with a Cyrillic keyboard and encoding will find it easy to type "Ильи́ч" (Ilyich), and so you only have to distinguish between the Cyrillic-capable encodings like UTF-8, UTF-16, iso8859_5, and koi8_r. Similarly, you could come up with Japanese, Chinese, and Korean character sequences which distinguish between users of Japanese, simplified Chinese, traditional Chinese, and Korean systems.
But let's concentrate on users of western European computer systems, and the common encodings like ISO-8859-15, Mac_Roman, UTF-8, UTF-16LE, and UTF-16BE. A very simple test is to have users enter the Euro character '€', U+20AC, and see what byte sequence gets generated:
byte ['\xa4'] means iso-8859-15 encoding
bytes ['\xe2', '\x82', '\xac'] mean utf-8 encoding
bytes ['\x00', '\xac'] mean utf-16be encoding
bytes ['\xac', '\x00'] mean utf-16le encoding
byte ['\x80'] means cp1252 ("Windows ANSI") encoding
byte ['\xdb'] means macroman encoding
iso-8859-1 won't be able to represent the Euro character at all. iso-8859-15 is the Euro-supporting successor to iso-8859-1.
U.S. users probably won't know how to type a Euro character. (OK, that's too snarky. 3% of them will know.)
You should check what each of these byte sequences, interpreted as any of the possible encodings, is not a character sequence that users would likely type themselves. For instance, the '\xa4' of the iso-8859-15 Euro symbol could also be the iso-8859-1 or cp1252 or UTF-16le encoding of '¤', the macroman encoding of '§', or the first byte of any of thousands of UTF-16 characters, such as U+A4xx Yi Syllables, or U+01A4 LATIN SMALL LETTER OI. It would not be a valid first byte of a UTF-8 sequence. If some of your users submit text in Yi, you might have a problem.
The Python 3.x documentation, 7.2.3. Standard Encodings lists the character encodings which the Python standard library can easily handle. The following program lets you see how a test character sequence is encoded into bytes by various encodings:
>>> for e in ['iso-8859-1','iso-8859-15', 'utf-8', 'utf-16be', 'utf-16le', \
... 'cp1252', 'macroman']:
... print e, list( euro.encode(e, 'backslashreplace'))
So, as an expedient, satisficing hack, consider telling your users to type a '€' as the first character of a text field, if there are any problems with encoding. Then your system should interpret any of the above byte sequences as an encoding clue, and discard them. If users want to start their text content with a Euro character, they start the field with '€€'; the first gets swallowed, the second remains part of the text.

What's the difference between encoding and charset?

I am confused about the text encoding and charset. For many reasons, I have to
learn non-Unicode, non-UTF8 stuff in my upcoming work.
I find the word "charset" in email headers as in "ISO-2022-JP", but there's no
such a encoding in text editors. (I looked around the different text editors.)
What's the difference between text encoding and charset? I'd appreciate it
if you could show me some use case examples.
Basically:
charset is the set of characters you can use
encoding is the way these characters are stored into memory
Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. A charset is simply what it sounds like, a set of characters. There are a large number of charsets, including many that are intended for particular scripts or languages.
However, we are well along the way in the transition to Unicode, which includes a character set capable of representing almost all the world's scripts. However, there are multiple encodings for Unicode. An encoding is a way of mapping a string of characters to a string of bytes. Examples of Unicode encodings include UTF-8, UTF-16 BE, and UTF-16 LE . Each of these has advantages for particular applications or machine architectures.
Throwing more light for people visiting henceforth, hopefully it would be helpful.
Character Set
There are characters in each language and collection of those characters form the “character set” of that language. When a character is encoded then it is assigned a unique identifier or a number called as code point. In computer, these code points will be represented by one or more bytes.
Examples of character set: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)
Coded Character Set
A coded character set is a set in which a unique number is assigned to each character. That unique number is called as "code point".
Coded character sets are sometimes called code pages.
Encoding
Encoding is the mechanism to map the code points with some bytes so that a character can be read and written uniformly across different system using same encoding scheme.
Examples of encoding: ASCII, Unicode encoding schemes like UTF-8, UTF-16, UTF-32.
Elaboration of above 3 concepts
Consider this - Character 'क' in Devanagari character set has a decimal code point of 2325 which will be represented by two bytes (09 15) when using the UTF-16 encoding
In “ISO-8859-1” encoding scheme “ü” (this is nothing but a character in Latin character set) is represented as hexa-decimal value of FC while in “UTF-8” it represented as C3 BC and in UTF-16 as FE FF 00 FC.
Different encoding schemes may use same code point to represent different characters, for example in “ISO-8859-1” (also called as Latin1) the decimal code point value for the letter ‘é’ is 233. However, in ISO 8859-5, the same code point represents the Cyrillic character ‘щ’.
On the other hand, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15)
A character encoding consists of:
The set of supported characters
A mapping between characters and integers ("code points")
How code points are encoded as a series of "code units" (e.g., 16-bit units for UTF-16)
How code units are encoded into bytes (e.g., big-endian or little-endian)
Step #1 by itself is a "character repertoire" or abstract "character set", and #1 + #2 = a "coded character set".
But back before Unicode became popular and everyone (except East Asians) was using a single-byte encoding, steps #3 and #4 were trivial (code point = code unit = byte). Thus, older protocols didn't clearly distinguish between "character encoding" and "coded character set". Older protocols use charset when they really mean encoding.
A character set, or character repertoire, is simply a set (an unordered collection) of characters. A coded character set assigns an integer (a "code point") to each character in the repertoire. An encoding is a way of representing code points unambiguously as a stream of bytes.
Googled for it.
http://en.wikipedia.org/wiki/Character_encoding
The difference seems to be subtle. The term charset actually doesn't apply to Unicode. Unicode goes through a series of abstractions.
abstract characters -> code points -> encoding of code points to bytes.
Charsets actually skip this and directly jump from characters to bytes.
sequence of bytes <-> sequence of characters
In short,
encoding : code points -> bytes
charset: characters -> bytes
A charset is just a set; it either contains, e.g. the Euro sign, or else it doesn't. That's all.
An encoding is a bijective mapping from a character set to a set of integers. If it supports the Euro sign, it must assign a specific integer to that character and to no other.
In my opinion, a charset is part of an encoding (a component), encoding has a charset attribute, so a charset can be used in many encodings. For example, Unicode is a charset used in encodings like UTF-8, UTF-16 and so on. See illustration here:
The char in charset doesn't mean the char type in the programming world. It means a character in the real world. In English it maybe the same, but in other languages not, like in Chinese, '我' is a inseparable 'char' in charsets (Unicode, GB [used in GBK and GB2312]), 'a' is also a char in charsets (ASCII, ISO-8859, Unicode).
In my opinion the word "charset" should be limited to identifying the parameter used in HTTP, MIME, and similar standards to specify a character encoding (a mapping from a series of text characters to a sequence of bytes) by name. For example:charset=utf-8.
I'm aware, though, that MySQL, Java, and other places may use the word "charset" to mean a character encoding.
An encoding is a mapping between bytes and characters from a character set, so it will be helpful to discuss and understand the difference between between bytes and characters.
Think of bytes as numbers between 0 and 255, whereas characters are abstract things like "a", "1", "$" and "Ä". The set of all characters that are available is called a character set.
Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.
Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.
For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.
Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║ Character ║
╠══════╬══════════════║
║ 60 ║ < ║
║ 61 ║ = ║
║ 62 ║ > ║
║ 63 ║ ? ║
║ 64 ║ # ║
║ 65 ║ A ║
╚══════╩══════════════╝
In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).
However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.
Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.
One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.
For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.
When computers store data about characters internally or transmit it to another system, they store or send bytes. Imagine a system opening a file or receiving message sees the bytes 195, 162. How does it know what characters these are?
In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used. That is why encoding appears in XML headers or can be specified in a text editor. It tells the system the mapping between bytes and characters.

How are unicode allocated for different languages?

It seems the most confusing issue to me.
How is the beginning of a new character recognized?
How are the codepoints allocated?
Let's take Chinese character for example.
What range of codepoints are allocated to them,
and why is it thus allocated,any reason?
EDIT:
Plz describe it in your own words,not by citation.
Or could you recommend a book that talks about Unicode systematically,which you think have made it clear(it's the most important).
The Unicode Consortium is responsible for the codepoint allocation. If you have want a new character or a code page allocated, you can apply there. See the proposal pipeline for examples.
Chapter 2 of the Unicode specification defines the general structure of Unicode, including what ranges are allocated for what kind of characters.
Take a look here for a general overview of Unicode that might be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)
Unicode is a standard specified by the Unicode Consortium. The specification defines Unicode’s character set, the Universal Character Set (UCS), and some encodings to encode that characters, the Unicode Transformation Formats UTF-7, UTF-8, UTF-16 and UTF-32.
How is the beginning of a new character recognized?
It depends on the encoding that’s been used. UTF-16 and UTF-32 are encodings with fixed code word lengths (16 and 32 bits respectively) while UTF-7 and UTF-8 have a variable code word length (from 8 bits up to 32 bits) depending on the character point that is to be encoded.
How are the codepoints allocated? Let's take Chinese character for example. What range of codepoints are allocated to them, and why is it thus allocated,any reason?
The UCS is separated into so called character planes. The first one is Basic Latin (U+0000–U+007F, encoded like ASCII), the second is Latin-1 Supplement (U+0080–U+00FF, encoded like ISO 8859-1) and so on.
It is better to say Character Encoding instead of Codepage
A Character Encoding is a way to map some character to some data (and also vice-versa!)
As Wikipedia says:
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers
Most popular character encodings are ASCII,UTF-16 and UTF-8
ASCII
First code-page that widely used in computers. in ANSI just one byte is allocated for each character. So ANSI could have a very limited set of characters (English letters, Numbers,...)
As I said, ASCII used videly in old operating systems like MS-DOS. But ASCII is not dead and still used. When you have a txt file with 10 characters and it is 10 bytes, you have a ASCII file!
UTF-16
In UTF-16, Two bytes is allocated of a character. So we can have 65536 different characters in UTF-16 !
Microsoft Windows uses UTF-16 internally.
UTF-8
UTF-8 is another popular way for encoding characters. it uses variable-length bytes (1byte to 4bytes) for characters. It is also compatible with ASCII because uses 1byte for ASCII characters.
Most Unix based systems uses UTF-8
Programming languages do not depend on code-pages. Maybe a specific implementation of a programming language do not support codepages (like Turbo C++)
You can use any code-page in modern programming languages. They also have some tools for converting the code-pages.
There is different Unicode versions like Utf-7,Utf-8,... You can read about them here (recommanded!) and maybe for more formal details here