UCS-2 unknown characters - unicode

From below link i can see some unknown characters of UCS-2. What are those?
Why are those unknown? So we cannot decode them?
http://www.columbia.edu/kermit/ucs2.html
Basically user is sending an ucs-2, dcs 8 message to our router. But when i decode it, then i am getting some junk characters. Ex: D83E DD13 --> this is printed as ? or some junk, how to print and view them in proper value in text file.
Thanks & regards,
Ashwini

Related

Did anyone ever heard about asciihex encoding?

this type of encoding is used in soap messages...
I'm receiving a message encoded in ASCIIHEX and I don't have any ideas on how this encoding actually works although I have the clear description of the encoding method:
"If this mode is used, every single original byte is encoded as a sequence of two characters representing it in hexadecimal. So, if the original byte was 0x0a, the transmitted bytes are 0x30 and 0x41 (‘0’ and ‘a’ in ASCII)."
The buffer received : "1f8b0800000000000000a58e4d0ac2400c85f78277e811f2e665329975bbae500f2022dd2978ff95715ae82cdcf9415efec823c6710247582d5965c32c65aab0f5fc0a5204c415855e7c190ef61b34710bcdc7486d2bab8a7a4910d022d5e107d211ed345f2f37a103da2ddb1f619ab8acefe7fdb1beb6394998c7dfbde3dcac3acf3f399f3eeae152012e010000"
The actual file contains this : "63CD13C1697540000000662534034000030000120011084173878R 00000001000018600050000000100460000009404872101367219 000000000000 DNSO_038114 000000002001160023Replacem000000333168625 N0000 00000000"
The provider sent me the file that contains the string above. I tried to start from the buffer string and get the same result as the one sent by the provider but no results. I also tried searching after this "asciihex" encoding and same. If someone knows anything about this encoding or can give me any advice I would really appreciate it. I have pretty much no experience with SOAP services.
Based on the comments above, it's possible the buffer is compressed. It starts with 1F 8B which is a signature for GZIP compression. See the following list of signatures.
Write the bytes that correspond to the hex strings into a file. Name that file with a gz or tar.gz extension and try to extract it or open it with some file archiver tool.
Another thing you could try would be to not send the Compress element in your request, assuming it's an optional field and you can do that. If you can, check if the buffer changes and has the proper length and you can see similar patterns as the original content (for those zeros at the end, for example).

understanding different character encodings

When I save a text document in UTF-8 that's basically saying: Computer, use the codepage for UTF8 that's installed somewhere on your computer to figure out, how to turn the 1's and 0's to characters, right?
When I save this content:
激光
äüß
#§
in ISO-8895-1, it becomes this (on Linux, using Kate editor):
æ¿å
äüÃ
#§
What is not displayed here is that in the first and second row that are some weird squares displayed instead of characters (can be seen in developer tools).
So my understanding is that this means that the combination of 0's and 1's that represent 激 in utf-8 is mapped to æ in ISO-8895-1, right? And the weird squares > < happen because there is no mapping for that binary number in the ISO-8895-1 character set so the computer defaults to some other encoding.
Is that correct?
Yes, sort of correct.
If you store a file as UTF-8, it usually gets a special byte combination that indicates its type of encoding at the beginning of the file. I think, Kate (don't know this editor) doesn't recognize this and just displays the file as something else. So basically, your file is still correct, but was just visualized in a wrong way.
The weird squares are another indicator, that Kate doesn't recognize those leading bytes, cause usually editors hide them from the user and just use the information to display the file correctly.
You have it pretty much right. The character U+6FC0 (激) for example is encoded with 3 bytes in UTF-8: 0xE6 0xBF 0x80.
If you interpret these bytes in ISO-8859-1, you get the characters æ¿. Depending on the version of ISO-8859-1, 0x80 is either not mapped to a character at all, or is mapped to a non-printable control character, that's why you can see only two characters for the three bytes.
If you use Windows-1252 instead of ISO-8859-1 you'll see æ¿€.

Add UDH for concatenated Unicode SMS

This
is the link I learned to send multi-part SMS in PDU, a very good tutorial.But how if I want to send Unicode SMS? From one of the comment from the developer:
Yes, the DCS should be 0×08 and the UDL should be in octets (which ends up being 1 + UDHL + 2 * number of characters). Also you don’t have to insert padding as in the GSM-7 case. I know you’ve already managed to send UCS-2 (not concatenated) messages, so it must be something small you’re missing. If you wish you can post your PDUs so I can check…
Jeroen
it seems I do not need to add 1 bit padding for the message. But if I using the same UDH format as normal SMS it will just show me unknown characters.
Can anyone give me some hints?
This is the sample PDU with chinese character but should be with errors..
0041000B910661345542F60000A00500030302010008044F60597D
Thanks.
Your DCS is wrong.
0041000B910661345542F6000*0*A00500030302010008044F60597D
should be
0041000B910661345542F6000*8*A00500030302010008044F60597D
for a DCS of 0x08 = UCS-2 encoding.

pySerial and reading binary data

When the device I am communicating with sends binary data, I can recover most of it. However, there always seem to be some bytes missing, replaced by non-standard characters. For instance, one individual output looks like this:
\xc4\xa5\x06\x00.\xb3\x01\x01\x02\x00\x00\x00=\xa9
The period and equals sign should be traditional bytes in hexadecimal format (I confirmed this in another application). Other times I get other weird characters such as ')' or 's'. These characters usually occur in the exact same spot (which varies with the command I passed to the device).
How can I fix this problem?
Are you displaying the output using something like this?:
print output
If some of your bytes happen to correspond with printable characters, they'll show up as characters. Try this:
print output.encode('hex')
to see hex values for all your bytes.
At first I liked #RichieHindle answer, but when I tried it the hex bytes were all bunched together.
To get a friendlier output, I use
print ' '.join(map(lambda x:x.encode('hex'),output))

Imap message encodeing problem

Some of the mails contents fetched from imap server looks like =C3=B6=C3=BC=C3=B6=C3=BC=C3=B6=C3=BC= what kind of encoding is this? Mail header encoding is UTF-8 but decoding with UTF-8 i got scrambled msg.
Any help is much appreciated.
Quoted-Printable
It is used to transmit 8-bit data over a 7-bit medium.
Characters are converted from 8-bit to three 7-bit characters in the form =XX where XX is the hexadecimal character code for the 8-bit character, the = character will become =3D.
The length of a line is restricted to 76 characters, soft line breaks are added to comply with this rule, this is done by ending with a = to indicate that the line should continue.
https://www.rfc-editor.org/rfc/rfc2045
http://en.wikipedia.org/wiki/Quoted-printable
Online Decoder