What can explain this bad character-encoding? - unicode

What "stack" of bad encoding would produce the following bytes of weirdness for the string "cinéma télédiffusion"? (I left out the space character, hex: 20)
cinÃ%ma
in HEX: 63 69 6E C3 83 25 6D 61
mapped: c i n ---�---- m a
tÃclÃcdiffusion
in HEX: 74 C3 83 63 6C C3 83 63 64 69 66 66 75 73 69 6F 6E
mapped: t ---�---- l ---�---- d i f f u s i o n
The ---�---- parts represent the bytes that aren't right.
I considered the idea "What if it was a messed up transcoding? How about a double encoding?", but, looking at http://www.fileformat.info/info/unicode/char/00e9/charset_support.htm (and the code page edition, too), I noted that there no encodings that could possibly end é with the hex bytes %25 or %63. It doesn't even look like double-UTF8 encoding at this point, because, http://en.wikipedia.org/wiki/UTF-8 clarified that bytes following a %C3 would need to be have the first bits set to 10xxxxxx.
How could some program have turned the accented é into an "Ã followed by %" as well as "Ã followed by c"? I want to trace back the history of the misencoding so that I can try to come up with something that can take steps at repairing the mangled strings.
There also exists the possibility that the é weren't ever é to begin with, but I can't fathom what kind of typo someone could have made in the same phrase to get two different versions of é that eventually get misencoded into two completely different sets of bytes.
Extra context details: I find these mangled strings inside of an XML file. The file has no <?xml version="1.0"?> header, so it's presumed to be UTF-8. There exists nodes containing phrases that have perfectly good é characters in them at the same time that there exists nodes containing phrases with mangled é characters.
iconv-and-family don't do anything at all to help this situation, as far as I've attempted.
A couple of trailing considerations that I now hold are: Should I suspect MySQL and its infamously lazy character set transcodings? Could it be somebody's really badly written custom encoding function as they exported the XML?

The encoding looks a bit strange:
Taken the é from cinéma results in for utf-8 encoding:
é = C3 A9
where you got:
C3 83 25
So when it will be double encoded the following should happen:
c3: Ã -> c3 83
a9: © -> c2 a9
But this will not explain the 25 within your result.
25: %
So the question is if this is encode once, then unknown characters like © will be replaced by % and then it's encoded a second time?

Related

Parser is not encoding string correctly

Text I'm trying to get:
przełącznica
This is what I actually have (browser might now view it properly - there are two squares instead of "łą"):
przecznica
BLOB:
70 72 7A 65 C5 82 C4 85 63 7A 6E 69 63 61
EDIT: This is what I get from parser
70 72 7A 65 1A 1A 63 7A 6E 69 63 61
ESQL used to parse BLOB:
DECLARE blobMsg BLOB InputRoot.BLOB.BLOB ;
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN ('XMLNSC') NAME 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC PARSE(blobMsg OPTIONS FolderBitStream CCSID 1208 FORMAT 'XMLNSC');
I have tried CCSIDs: 1208 (UTF8), 912 (ISO-8859-2), 1200(UTF16 I guess):
https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_71/nls/rbagsccsidcdepgscharsets.htm
EDIT: Working code:
DECLARE blobMsg BLOB InputRoot.BLOB.BLOB;
DECLARE remove BLOB X'EFBBBF';
DECLARE message BLOB REPLACE(InputRoot.BLOB.BLOB, remove, CAST('' AS BLOB));
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN ('XMLNSC') NAME 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC PARSE(message OPTIONS FolderBitStream CCSID 05348 FORMAT 'XMLNSC');
Firstly przełącznica by itself is not valid XML and so you'll get an exception when you try to invoke the XMLNSC parser using the code you have outlined. You need to do a CAST instead.
I generated a little test Application/MsgFlow in IIB 10 to illustrate CASTing the BLOB.
The code in ConvertAndParse is
CREATE COMPUTE MODULE ConvertAndParse
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
DECLARE blobMsg BLOB X'70727A65C582C485637A6E696361';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC NAME 'AsUtf8' VALUE CAST(blobMsg AS CHAR CCSID 1208);
CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC';
CREATE LASTCHILD OF OutputRoot.XMLNSC.EncodingResponse NAME 'AsUtf8InTag' VALUE CAST(blobMsg AS CHAR CCSID 1208);
CREATE LASTCHILD OF OutputRoot.XMLNSC.EncodingResponse NAME CAST(blobMsg AS CHAR CCSID 1208) VALUE 'As a tag name';
RETURN TRUE;
END;
END MODULE;
When I run a debug session the value put into the LocalEnvironment tree looks like.
And the result of invoking the flow from a browser.
Now let's deal with the which encoding we are looking at. Looking at what I assume is the input BLOB let's see if the BLOB matches up with UTF-8.
70 72 7A 65 C5 82 C4 85 63 7A 6E 69 63 61
UTF-8 is a variable width character encoding that sets the high order bit to indicate two or more bytes. We also want a page that shows the common code points for UTF-8 Complete Character List for UTF-8. Note it's not actually complete.
Looking at the first 4 bytes none of them have the high order bit on
70 72 7A 65
And the aforementioned Character List says that's prze, so far so good.
Then we hit C8 which has the high order bit on. Doing a bit of visual parsing we get two sets of probable two byte character pairs
C5 82
C4 85
Referring to the Character List our two candidate pairs do in fact match the two characters we want and the next six characters which do not have their high order bits on translate to cznica. Looking really good.
Now to eliminate the other candidate encodings, if we can.
UTF-16 uses 2 or 4 bytes to represent each character depending on the Byte Order Mark with prze encoded as
UTF-16BE - CP 1200 - 00 70 00 72 00 7A 00 65
UTF-16LE - CP 1202 - 70 00 72 00 7A 00 65 00
Given that there are not lots and lots of null characters 00 it is reasonable to discount UTF-16.
ISO-8859-2 - CP 912 is a single byte character set and the C5 and C4 code points do not match the two desired characters and thus we can eliminate it.

Send Concatenating SMS with Pdu Format

Sir,
I have send sms in PDU formate through AT commands.
AT+CMGS=18
0011000C912933634241140000AA04D370DA0C
Message send successfuly.But when i am trying to send message with UDH & UDHL i am using the Following At Command but show me Error .....
AT+CMGS=24
0011000C912933634241140000AA05000303020104D370DA0C
What is the wrong in my code please help me.
Is it 7 bit encoded? I'm trying to solve this problem myself, and the problem itself is that your UDH requires the message part (04D370DA0C) to be padded (at least in my case).
The text below is from https://en.wikipedia.org/wiki/Concatenated_SMS#PDU_Mode_SMS
the UDH is a total of (number of octets x bit size of octets) 6 x 8 = 48 bits long. Therefore, a single bit of padding has to be prepended to the message. The UDH is therefore (bits for UDH / bits per septet) = (48 + 1)/7 = 7 septets in length.
With a message of "Hello world", the [message] is encoded as
90 65 36 FB 0D BA BF E5 6C 32
as you need to prepend the least significant bits of the next 7bit
character whereas without padding, the [message] would be
C8 32 9B FD 06 DD DF 72 36 19
and the UDL is 7 (header septets) + 11 (message septets) = 18 septets.

perl perlpacktut not making sense for me

I am REALLY confused about pack and unpack definition for perl.
Below is the excerpt from perl.doc.org
The pack function converts values to a byte sequence containing
representations according to a given specification, the so-called
"template" argument. unpack is the reverse process, deriving some values
from the contents of a string of bytes.
So I get the idea that pack takes human readable things(such as A) and turn it into binary format. Am I wrong on this interpretation??
So that is my interpreation but then same doc immediately proceeds to put this example which put my understanding exactly the opposite.
my( $hex ) = unpack( 'H*', $mem );
print "$hex\n";
What am I missing?
The pack function puts one or more things together in a single string. It represents things as octets (bytes) in a way that it can unpack reliably in some other program. That program might be far away (like, the distance to Mars far away). It doesn't matter if it starts as something human readable or not. That's not the point.
Consider some task where you have a numeric ID that's up to about 65,000 and a string that might be up to six characters.
print pack 'S A6', 137, $ARGV[0];
It's easier to see what this is doing if you run it through a hex dumper as you run it:
$ perl pack.pl Snoopy | hexdump -C
00000000 89 00 53 6e 6f 6f 70 79 |..Snoopy|
The first column counts the position in the output so ignore that. Then the first two octets represent the S (short, 'word', whatever, but two octets) format. I gave it the number 137 and it stored that as 0x8900. Then it stored 'Snoopy' in the next six octets.
Now try it with a shorter name:
$ perl test.pl Linus | hexdump -C
00000000 89 00 4c 69 6e 75 73 20 |..Linus |
Now there's a space character at the end (0x20). The packed data still has six octets. Try it with a longer name:
$ perl test.pl 'Peppermint Patty' | hexdump -C
00000000 89 00 50 65 70 70 65 72 |..Pepper|
Now it truncates the string to fit the six available spaces.
Consider the case where you immediately send this through a socket or some other way of communicating with something else. The thing on the other side knows it's going to get eight octets. It also knows that the first two will be the short and the next six will be the name. Suppose the other side stored that it $tidy_little_package. It gets the separate values by unpacking them:
my( $id, $name ) = unpack 'S A6', $tidy_little_package;
That's the idea. You can represent many values of different types in a binary format that's completely reversible. You send that packed string wherever it needs to be used.
I have many more examples of pack in Learning Perl and Programming Perl.

Base64 Encoding and Decoding

I would appreciate if someone could please explain this to me.
I came across this post (not important just reference) and saw a token encoded with base64 where the guy decoded it.
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk= -> t3+(:APPMOBI
I then tried to encode t3+(:APPMOBI again using base64 to see if I would get the same result, but was very surprised to get:
t3+(:APPMOBI - > dDMrKDpBUFBNT0JJ
Completly different token.
I then tried to decode the original token EYl0htUzhivYzcIo+zrIyEFQUE1PQkk= and got t3+(:APPMOBI with random characters between it. (I got ◄ëtå╒3å+╪═┬(√:╚╚APPMOBI could be wrong, I quickly did it off the top off my head)
What is the reason for the difference in tokens were they not supposed to be the same?
The whole purpose of base64 encoding is to encode binary data into text representation so that they can be transmitted over the network or displayed without corruption. But it ironically happened with the original post you were referring to,
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk= does NOT decode to t3+(:APPMOBI
instead, it contains some binary bytes(not random btw) that you correctly showed. So the problem was due to the original post where either the author or the tool/browser that she/he used "cleaned up", or rather corrupted the decoded binary data.
There is always one-to-one relationship between encoded and decoded data (provided the same "base" is used, i.e. the same set of characters are used for encoded text.)
t3+(:APPMOBI indeed will be encoded into dDMrKDpBUFBNT0JJ
The problem is in the encoding that displayed the output to you, or in the encoding that you used to input the data to base64. This is actually the problem that base64 encoding was invented to help solve.
Instead of trying to copy and paste the non-ASCII characters, save the output as a binary file, then examine it. Then, encode the binary file. You'll see the same base64 string.
c:\TEMP>type b.txt
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk=
c:\TEMP>base64 -d b.txt > b.bin
c:\TEMP>od -t x1 b.bin
0000000 11 89 74 86 d5 33 86 2b d8 cd c2 28 fb 3a c8 c8
0000020 41 50 50 4d 4f 42 49
c:\TEMP>base64 -e b.bin
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk=
od is a tool (octal dump) that outputs binary data using hexadecimal notation, and shows each of the bytes.
EDIT:
You asked about a different string in your comments, dDMrKDpBUFBNT0JJ, and why does that decode to the same thing? Well, it doesn't decode to the same thing. It decodes to this string of bytes: 74 33 2b 28 3a 41 50 50 4d 4f 42 49. Your original string decoded to this string of bytes: 11 89 74 86 d5 33 86 2b d8 cd c2 28 fb 3a c8 c8 41 50 50 4d 4f 42 49.
Notice the differences: your original string decoded to 23 bytes, your second string decoded to only 12 bytes. The original string included non-ASCII bytes like 11, d5, d8, cd, c2, fb, c8, c8. These bytes don't print the same way on every system. You referred to them as "random bytes", but they're not. They're part of the data, and base64 is designed to make sure they can be transmitted.
I think to understand why these strings are different, you need to first understand the nature of character data, what base64 is, and why it exists. Remember that computers work only on numbers, but people need to work with familiar concepts like letters and digits. So ASCII was created as an "encoding" standard that represents a little number (we call this little number a "byte") as a letter or a digit, so that we humans can read it. If we line up a group of bytes, we can spell out a message. 41 50 50 4d 4f 42 49 are the bytes that represent the word APPMOBI. We call a group of bytes like this a "string".
Every letter from A-Z and every digit from 0-9 has a number specified in ASCII that represents it. But there are many extra numbers that are not in the standard, and not all of those represent visible or sensible letters or digits. We say they're non-printable. Your longer message includes many bytes that aren't printable (you called them random.)
When a computer program like email is dealing with a string, if the bytes are printable ASCII characters, it's easy. The email program knows what to do with them. But if your bytes instead represent a picture, the bytes could have values that aren't ASCII, and various email programs won't know what to do with them. Base64 was created to take all kinds of bytes, both printable and non-printable bytes, and translate them into a string of bytes representing only printable letters. Because they're all printable, a program like email or a web server can easily handle them, even if it doesn't know that they actually contain a picture.
Here's the decode of your new string:
c:\TEMP>type c.txt
dDMrKDpBUFBNT0JJ
c:\TEMP>base64 -d c.txt
t3+(:APPMOBI
c:\TEMP>base64 -d c.txt > c.bin
c:\TEMP>od -t x1 c.bin
0000000 74 33 2b 28 3a 41 50 50 4d 4f 42 49
0000014
c:\TEMP>type c.bin
t3+(:APPMOBI
c:\TEMP>

Base64 decoding gives different result

I'm working on a little Streamserve project (google it :P) where I get some Base64 encoded content. I've tried to decode the base64 string with multiple decoders and all return the correct result.. except the Base64DecodeString method in Streamserve.
The encoded string is: 'VABlAHMAdABpAG4AZwAgAGIAYQBzAGUANgA0AA==' The expected result is: 'Testing base64'
However within Streamserve the result is: 'Tsig ae4'
It simply skips every other letter. Now I know most people dont know Streamserve, but I have a hunch that this might be a character encoding problem.. problem and was hoping someone has a clue what might be happening here.
I can without any problem encode/decode strings within streamserve.. just not strings I get as input
The issue is that you're encoding in UTF-16 and decoding back to ASCII or UTF8. Change your string encoding to UTF8 before encoding the string to base64 and it should work fine.
Here's the hex dump of that base64 blob:
54 00 65 00 73 00 74 00 69 00 6e 00 67 00 20 00 62 00 61 00 73 00 65 00 36 00 34 00
If you remove the null bytes, you get this:
54 65 73 74 69 6e 67 20 62 61 73 65 36 34
Which translates to the following ASCII text:
Testing base64
The result of decoding base64 is binary data (and likewise the input when encoding it is binary data). To go from binary data to a string or vice versa, you need to apply an encoding such as UTF-8 or UTF-16. You need to find out which encoding Streamserve is using, and use the same encoding when you convert your text data to binary data to start with, before base64-encoding it.
It sounds like you might want to use UTF-16 to encode your text to start with, although in that case I'm surprised you're not just getting garbage out... it looks like it's actually ignoring every other byte in the decoded base64, rather than taking it as the high byte in a UTF-16 code unit.