How to define/declare utf-8 code points for Turkish special chars (non-ascii) to use them as standard utf-8 encoding? - encoding

Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.
So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"?
The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this ->
Character : Ç
Character name : LATIN CAPITAL LETTER C WITH CEDILLA
Hex code point : 00C7
Decimal code point : 199
Hex UTF-8 bytes : C387
......
Where/How can we save this info to be a standard utf-8 char?
How can we distribute/expose it (make ready to be used by others) ?
Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium)
How can we detect/fixup errors if they are already registered but not working correctly?
Can we have custom-utf8 configuration? If yes how?
Note : No code snippet is needed here as it is not mis-usage problem.

The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
Code: 00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc
This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.
Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:
Locale tr = new Locale("TR","tr"); // Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); // ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); // ççğğiıööşşüü
Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.
Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.
Update 2:
The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.
Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.

Related

How do you determine the byte width of a UTF-16 character?

What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).
From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:
Value of leading 2 bytes
Resulting character length (bytes)
0x0000-0xC7FF
2
0xD800-0xDBFF
4
0xDC00-0xDFFF
Invalid sequence (RFC2781 2.2.2)
0xDFFF-0xFFFF
4
In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:
Character
Unicode Name
ISO 10646
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
A
LATIN CAPITAL LETTER A
U+0041
00 41
2
Б
CYRILLIC CAPITAL LETTER BE
U+0411
04 11
2
ァ
KATAKANA LETTER SMALL A
U+30A1
30 A1
2
🐰
RABBIT FACE
U+1F430
D8 3D DC 30
4
However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:
Character
Unicode Name
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
⚕️
STAFF OF AESCULAPIUS
26 95 FE 0F
4
Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).
select cast(N'⚕️' as varbinary);
----------
0x95260FFE
Quite simply, how/why do you read 0x2695 and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?
The formal definition of all of this is called an "extended grapheme cluster," and it's defined in the Unicode Text Segmentation report. As Joachim Sauer notes, it's wise to be careful with the term "character" in Unicode.
Code points are what "U+...." syntax is referring to, and is attempting to capture a "unit" of written language, for example "an acute accent." But what a reader would think of a character (for example "an e with an acute accent") is a "grapheme cluster" and is made up of one or more code points. What is ultimately rendered to the screen is a "glyph" which is both context- and font-dependent.
Grapheme clusters in Unicode are actually more subtle than this. Unicode attempts to define them in a "neutral" way. (There's really no such thing as "neutral" when thinking about languages, but Unicode does try.) For example, in Slovak, ch, dz, and dž are each one letter, but are considered two grapheme clusters in Unicode. (Try to count the "letters" in a Slovak word. There are words that contain the letter dz and other words that have the letter d followed by the letter z. Oh human writing systems. I love you so much.)
The mapping of grapheme clusters to glyphs is also complex. For example, in Arabic, the single glyph لا is actually two grapheme clusters, ل (ARABIC LETTER LAM) followed by ا (ARABIC LETTER ALEF). If you use your mouse to select the glyph, you'll see there are two selectable pieces, and if you copy and paste them to another window you'll see them transform into their component parts. (Just to make thing even more complicated, Unicode also defines a single code point for ligature, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM: ﻻ. If you try to select part of that one, you'll find you can't. It's one "character.")
Your specific case is a bit more special. The Variation Selector predates Unicode, and is mostly designed to handle different variations of Han (Chinese) characters. However, as with every Unicode feature, it eventually has come to be used primarily for emoji. VS-16 is the "emoji" presentation form. The most famous example is the red heart, which is HEAVY BLACK HEART ❤, followed by VS-16: ❤️.
Similarly, your character U+2695 STAFF OF AESCULAPIUS is a single code point, and it looks like this by default (text style): ⚕. When you add VS-16, it is rendered in "emoji style": ⚕️. In some ways it's the same "character." Or is it? Depends on what you're using it for.
Emoji style is typically a bit larger and centered in its block, sometimes adding color. Notice where the period after the staff is drawn in each case (there are no extra spaces in the second example; the glyph is just much wider).
There are other combining systems as well:
U+0031: 1
U+0031 U+20e3: 1⃣ (+ COMBINING ENCLOSING KEYCAP, default text style)
U+0031 U+20e3 U+fe0f: 1⃣️ (+ VARIATION SELECTOR-16, emoji style)
All of these predate Unicode. Modern emoji is dramatically more complicated, and includes several combining systems of its own (including two that are currently just used for flags).
But luckily, to your actual question, your wife is correct, and you can generally just consume all trailing code points that are marked "combining" to form an extended grapheme cluster, and that is kind of a "character" for some broad enough definition of "character."
All of your assertions are completely correct; your interpretation of the UTF-16 standards is correct and complete.
In your empirical observations however, you've assumed that you only have one character. In actuality, you've ran into a nuance of the Unicode implementation. Your "character" is actually two (albeit technically, not visually): U+2695 "STAFF OF AESCULAPIUS" followed by U+FE0F "VARIATION SELECTOR-16". The second character is a non-spacing mark which combines with the base character for the purpose of rendering a character variant.
This results in the byte sequence 26 95 FE 0F, however as you note neither of the words fall within the UTF-16 reserved extension character range. But this is because neither of them require the UTF-16 4 byte extension. They're simply classified as two discrete Unicode characters.
From 7.9 Combining Marks in ISO 10646: Universal Coded Character Set (UCS):,
Combining marks are a special class of characters in the Unicode Standard that are
intended to combine with a preceding character, called their base.
Combining marks usually have a visible glyphic form... a combining mark may interact graphically with neighbouring characters in various ways.
http://unicode.org/L2/L2010/10038-fcd10646-main.pdf
To explain why I'm answering my own question; I had my SO question all ready to fire off. My wife came into my office; after looking over my shoulder she whispered into my ear, "You know combination characters are a thing, right?". I've however still asked the question and answered it myself, in case my wife's sweet nothings help another member of the community.

How to identify encoding from hex values?

I have text on a website that displays like that: o¨ instead of ö
I extracted the text out of the CMS and analysed it's hex values:
the ö's that are displays correctly have c3 b6 - UTF-8
the ö's that are displayed incorrect have 6f cc 88
I couldn't find out what encoding this is. What's a good way to identify the encoding?
6F is the UTF-8 (ASCII) encoding of "o", nothing spectacular.
CC 88 is the UTF-8 encoding of U+0308, COMBINING DIAERESIS.
You're simply looking at the decomposed form of the o-umlaut. A combining diaereses character should visually be rendered, well, combined with the previous character. If your system doesn't do that, it means it doesn't treat Unicode correctly, and/or the font you have chosen is somewhat broken. Perhaps you have to normalise your strings into the composed Unicode form instead for your system to handle it correctly.

SMS data field character

I m looking up the format of GSM SMS.
When PDU mode is used, the TP-UD field is said to be one of the three, 7bit is for ascii symbol, 8 bit is for data, and UCS2 is for the unicode, like Japanese.
There is an example, Hello! has the TP-UD field C8 32 9B FD 0E 01. why? It's not ascii, not GSM03.38 basic character set.
And what if the user data is a mix of ascii character and Japanese, is it unicode for all?
Thanks.
Short message content encoding type(7-bit, 8-bit, 16 bit etc. ) is being chosen by looking the data coding scheme parameter value. If message content consists mixed characters of GSM default alphabet and unicode(e.g. Russian, Arabic, Japan etc), data coding scheme value must be set to 16 bit(UCS-2).
GSM 7-bit default alphabet is for English and several European languages. A limited number of languages, like Portuguese, Spanish, Turkish may use 7-bit encoding with national language shift table defined in 3GPP 23.038. 8-bit encoding is dedicated to binary short messages.
Try Cloudhopper Java SMPP API Charset utility class
msgChars = CharsetUtil.encode("öàß", CharsetUtil.CHARSET_GSM);
msgChars = CharsetUtil.encode("Точно так и было!", CharsetUtil.CHARSET_UCS_2);

What character encoding is c3 82 c2 bf?

I have a source of text data that includes the byte sequence c3 82 c2 bf. In context I think it's supposed to be a capital Greek Phi symbol (Φ).
Anyway I can't figure out what encoding is being used; I'm writing a Python script to process this data into a database that expects Unicode, and it throws an exception on this particular sequence of data.
Any suggestions on how to handle it?
Interpreted as UTF-8, c3 82 is “” U+00C2 and c2 bf is “¿” U+00BF, which does not make much sense, but it’s technically valid UTF-8 data, so it should not be reported as character-level data error. Interpreted as UTF-16, it’s Hangul syllables and possibly a CJK ideograph, depending on endianness, but still formally valid data, though most probably not what was meant.
This sounds like the result of double conversion, but it’s difficult to make educated guesses. If it stands for Φ, then the UTF-16 form is 03 A6 or A6 03 and the UTF-8 form is CE A6, which don’t really resemble the actual data. Information about the origin of the data might help in guessing what transcodings may have happened.
It's probably a double conversion from Ñ character.
Ñ character in UTF-8 is: 0xc391.
If you try to convert from LATIN-1 to UTF-8 the Ñ character which is already encoded in UTF-8, you'll get: 0xc382c2bf.
Why?
0xc382 is UTF-8 translation from LATIN-1 0xc3 character à (A with tilde)
0xc2bf is ¿ character which is what you get when you can't convert a character from LATIN-1 (0x91 is an invalid character in LATIN-1
FWIW, I ended up with c3 82 c2 bf from . I did not dig into the transformations because I was able to simply throw that part of the code away. Suffice it to say that was in an html email template that was processed by a wordpress (php) plugin.
I don't know the reason. But maybe there is a possible scenary.
binary x0xx is converted to 0xC2 x0xx
binary x1xx is converted to 0xC3 x0xx
So there are lots of c2 and c3 added.
Where does this happen? Send non ascii in url query string for an ajax call, the Flask server will do this.
i have received this character \xc3\x82 from external utf-16 document after conversion to utf-8 using $str = mb_convert_encoding($content, "UTF-8" , "UTF-16LE"); (PHP)
original sequence was 0xA0 0x00 and the converter converted it probably to what it meant to be NBSP .. it was character at thousands separator in currency number. nbsp is \xc2\xa0 so right now i have thousands removal as:
$price = str_replace(["\xc2\xa0","\xc3\x82"], '', $price);

Unicode characters between \u0003 and \u00ff

I have a piece of Java code that is checking it is between two unicode characters:
LA(2) >= '\u0003' && LA(2) <= '\u00ff'
I understand that \u0003 represents END OF TEXT and \u00ff is LATIN SMALL LETTER Y WITH DIAERESIS, but what lies between these points? (what is it checking that LA(2) is?)
e.g. is it all Latin characters, or number characters, or characters with accents, all ascii characters, or something else?
It's Latin 1 minus the code points U+0000, U+0001 and U+0002. This includes the usual stuff that can be found on the US keyboard, plenty of control characters (below U+0020 and between U+007F and U+009F) and a few other Latin characters that can be used to write the majority of Western European languages.
The following ranges are declared:
0000 - 007F C0 Controls and Basic Latin
0080 - 00FF C1 Controls and Latin-1 Supplement
To check out which unicode value represents which character, I advise to have a look at one of the following links:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
http://unicode.org/
It's the basic latin1 character set except the first 3 codes.
0x0000 - 0x007F : Basic Latin (128)
0x0080 - 0x00FF : Latin-1 Supplement (128)
The code probably checks whether the character can be output as a single byte char (latin1 encoded).