Is this case a weird UTF-8 encoding conversion?

Is this case a weird UTF-8 encoding conversion? - unicode

I am working with a remote application that seems to do some magic with the encoding. The application renders clear responses (which I'll refer as True and False), depending on user input. I know two valid values, that will render 'True', all the others should be 'False'.
What I found (accidently) interesting is, that submitting corrupted value leads to 'True'.
Example input:
USER10 //gives True
USER11 //gives True
USER12 //gives False
USER.. //gives False
OTHERTHING //gives False
so basically only these two first values render True response.
What I noticed is that USERÀ±0 (hex-wise \x55\x53\x45\x52\C0\xB1\x30) is accepted as True, surprisingly.
I did check other hex bytes, with no such success. It leads me to a conclusion that \xC0\xB1 could be somehow translated into 0x31 (='1').
My question is - how it could happen? Is that application performing some weird conversion from UTF-16 (or sth else) to UTF-8?
I'd appreciate any comments/ideas/hints.

C0 is an invalid start byte for a two-byte UTF-8 sequence, but if a bad UTF-8 decoder accepts it C0 B1 would be interpreted as ASCII 31h (the character 1).
Quoting Wikipedia:
...(C0 and C1) could only be used for an invalid "overlong encoding" of ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using two bytes instead of one....

Related

I have a problem converting a Unicode character into two hexadecimal bytes in UTF-8

I have a problem and I hope you can help me... basically I just started UTF-8 and Unicode, the professor wrote a text file, he wrote "ciaò" inside and showed us the content, displaying each character in hexadecimal (for example the 'c' is 0063, the 'i' is 0069, the 'a' is 0061). The problem is the 'ò' character, which is formed by 2 bytes in UTF-8: c3; b2 (hex). The exercise he gave us is to verify that in UTF-8 the 'ò' character is written just like that (for the resolution he advised us to look at the Unicode website).
I tried to do the exercise this way: I saw that the character 'ò' in hex is 00F2, I transformed it into binary (11110010) and I formed the two bytes of UTF-8 filling the bytes to complete them. |110|11110| e |10|010000|. The problem is that this way I get the following values: DE (instead of c3 for the first byte); 90 (instead of b2 for the second byte). Can someone explain me where I am wrong please?

For the character "ò", its UTF-16 representation is 00F2, and its UTF-8 representation is C3B2. I don't think you can use 00F2 to represent it in UTF-8.
To verify C3B2 is "ò", you can check a website like this one, or if you are using a linux-like terminal you can write:
echo -e "\xC3\xB2"
Which should simply print "ò".

Are all character encoding compatible with ascii, for at least a-z

Some charset don't have all the 128 first identical to ascii, but is A to Z and a to z, always in the sam position?
I had a plan to set apaches default charset to somting odd in my test envirement, for easy detecting sites that don't tell the browser what encoding it sending.
But so far, I can't find one that makes A to Z show up as someting else.
There is an other question close to the subject, but thats about all 128 ascii chars:
Are ASCII characters always encoded the same way in all character encodings?

No, EBCDIC from IBM is the famous exception.
Another testcase is UTF-16 Big Endian, which puts "A" at U+0041. ASCII would treat the first 00 as a NUL, which often is interpreted as an end-of-string.

In short: no. The encoding mentioned in the other answer, EBCDIC, has a very different layout, to pick just one example.
Most encodings you find in the wild on the web today are probably ASCII compatible. But there are encodings other than ASCII which are entirely incompatible too.

How does WTX (websphere transformation extender) know whether the character is ASCII or UNICODE?

ASCII is 8-bit value. Unicode may be 8 or 16 or 32 bit value. If I define subclass as character how does WTX know whether it is 8 or 16 bit character?

Setting an item's subclass to character is only one half of the solution. You also have to set the language (defaults to "Western") and, more important, the character set. If you choose UTF-8 (-16, -32), the parser is capable of recognizing multi-byte characters and will read them properly (given that the document being parsed is encoded in the type tree's encoding, of course).

Checking Unicode string for whitespace - byte for byte!

Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.

For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).

In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.
You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.

Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.
You can see this in the following table:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).
For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.
See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.

It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.

base64 encoding: input character

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.

Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.

For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse