I have a source of text data that includes the byte sequence c3 82 c2 bf. In context I think it's supposed to be a capital Greek Phi symbol (Φ).
Anyway I can't figure out what encoding is being used; I'm writing a Python script to process this data into a database that expects Unicode, and it throws an exception on this particular sequence of data.
Any suggestions on how to handle it?
Interpreted as UTF-8, c3 82 is “” U+00C2 and c2 bf is “¿” U+00BF, which does not make much sense, but it’s technically valid UTF-8 data, so it should not be reported as character-level data error. Interpreted as UTF-16, it’s Hangul syllables and possibly a CJK ideograph, depending on endianness, but still formally valid data, though most probably not what was meant.
This sounds like the result of double conversion, but it’s difficult to make educated guesses. If it stands for Φ, then the UTF-16 form is 03 A6 or A6 03 and the UTF-8 form is CE A6, which don’t really resemble the actual data. Information about the origin of the data might help in guessing what transcodings may have happened.
It's probably a double conversion from Ñ character.
Ñ character in UTF-8 is: 0xc391.
If you try to convert from LATIN-1 to UTF-8 the Ñ character which is already encoded in UTF-8, you'll get: 0xc382c2bf.
Why?
0xc382 is UTF-8 translation from LATIN-1 0xc3 character à (A with tilde)
0xc2bf is ¿ character which is what you get when you can't convert a character from LATIN-1 (0x91 is an invalid character in LATIN-1
FWIW, I ended up with c3 82 c2 bf from . I did not dig into the transformations because I was able to simply throw that part of the code away. Suffice it to say that was in an html email template that was processed by a wordpress (php) plugin.
I don't know the reason. But maybe there is a possible scenary.
binary x0xx is converted to 0xC2 x0xx
binary x1xx is converted to 0xC3 x0xx
So there are lots of c2 and c3 added.
Where does this happen? Send non ascii in url query string for an ajax call, the Flask server will do this.
i have received this character \xc3\x82 from external utf-16 document after conversion to utf-8 using $str = mb_convert_encoding($content, "UTF-8" , "UTF-16LE"); (PHP)
original sequence was 0xA0 0x00 and the converter converted it probably to what it meant to be NBSP .. it was character at thousands separator in currency number. nbsp is \xc2\xa0 so right now i have thousands removal as:
$price = str_replace(["\xc2\xa0","\xc3\x82"], '', $price);
Related
I have a problem and I hope you can help me... basically I just started UTF-8 and Unicode, the professor wrote a text file, he wrote "ciaò" inside and showed us the content, displaying each character in hexadecimal (for example the 'c' is 0063, the 'i' is 0069, the 'a' is 0061). The problem is the 'ò' character, which is formed by 2 bytes in UTF-8: c3; b2 (hex). The exercise he gave us is to verify that in UTF-8 the 'ò' character is written just like that (for the resolution he advised us to look at the Unicode website).
I tried to do the exercise this way: I saw that the character 'ò' in hex is 00F2, I transformed it into binary (11110010) and I formed the two bytes of UTF-8 filling the bytes to complete them. |110|11110| e |10|010000|. The problem is that this way I get the following values: DE (instead of c3 for the first byte); 90 (instead of b2 for the second byte). Can someone explain me where I am wrong please?
For the character "ò", its UTF-16 representation is 00F2, and its UTF-8 representation is C3B2. I don't think you can use 00F2 to represent it in UTF-8.
To verify C3B2 is "ò", you can check a website like this one, or if you are using a linux-like terminal you can write:
echo -e "\xC3\xB2"
Which should simply print "ò".
I have text on a website that displays like that: o¨ instead of ö
I extracted the text out of the CMS and analysed it's hex values:
the ö's that are displays correctly have c3 b6 - UTF-8
the ö's that are displayed incorrect have 6f cc 88
I couldn't find out what encoding this is. What's a good way to identify the encoding?
6F is the UTF-8 (ASCII) encoding of "o", nothing spectacular.
CC 88 is the UTF-8 encoding of U+0308, COMBINING DIAERESIS.
You're simply looking at the decomposed form of the o-umlaut. A combining diaereses character should visually be rendered, well, combined with the previous character. If your system doesn't do that, it means it doesn't treat Unicode correctly, and/or the font you have chosen is somewhat broken. Perhaps you have to normalise your strings into the composed Unicode form instead for your system to handle it correctly.
Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.
So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"?
The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this ->
Character : Ç
Character name : LATIN CAPITAL LETTER C WITH CEDILLA
Hex code point : 00C7
Decimal code point : 199
Hex UTF-8 bytes : C387
......
Where/How can we save this info to be a standard utf-8 char?
How can we distribute/expose it (make ready to be used by others) ?
Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium)
How can we detect/fixup errors if they are already registered but not working correctly?
Can we have custom-utf8 configuration? If yes how?
Note : No code snippet is needed here as it is not mis-usage problem.
The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
Code: 00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc
This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.
Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:
Locale tr = new Locale("TR","tr"); // Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); // ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); // ççğğiıööşşüü
Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.
Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.
Update 2:
The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.
Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.
I'm having problems converting the string to something readable . I'm using
NSString *substring = [NSString stringWithUTF8String:[symbol.data cStringUsingEncoding:NSUTF8StringEncoding]];
but I can't convert \U7ab6\U51b1 into '
It shows as 窶冱 which is what I don't want, it should show as an '. Can anyone help me?
it is shown as a ’
That's character U+2019 RIGHT SINGLE QUOTATION MARK.
What has happened is you've had the character sequence ’s submitted to you, in the UTF-8 encoding, which comes out as bytes:
’ s
E2 80 99 73
That byte sequence has then, incorrectly, been interpreted as if it were encoded in Windows code page 932 (Japanese; more or less Shift-JIS):
E2 80 99 73
窶 冱
So in this one particular case, you could recover the ’s string by firstly encoding the characters into cp932 bytes, and then decoding those bytes back to characters using UTF-8.
However, this will not solve your real problem, which is that the strings were read in incorrectly in the first place. You got 窶冱 in this case because the UTF-8 byte sequence resulting from encoding ’s happened also to be a valid Shift-JIS byte sequence. But that won't be the case for all possible UTF-8 byte sequences you might get. Many other characters will be unrecoverably mangled.
You need to find where bytes are being read into the system and decoded as Shift-JIS, and fix that to use UTF-8 instead.
Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.
For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).
In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.
You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.
Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.
You can see this in the following table:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).
For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.
See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.
It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.