Add UDH for concatenated Unicode SMS - unicode

This
is the link I learned to send multi-part SMS in PDU, a very good tutorial.But how if I want to send Unicode SMS? From one of the comment from the developer:
Yes, the DCS should be 0×08 and the UDL should be in octets (which ends up being 1 + UDHL + 2 * number of characters). Also you don’t have to insert padding as in the GSM-7 case. I know you’ve already managed to send UCS-2 (not concatenated) messages, so it must be something small you’re missing. If you wish you can post your PDUs so I can check…
Jeroen
it seems I do not need to add 1 bit padding for the message. But if I using the same UDH format as normal SMS it will just show me unknown characters.
Can anyone give me some hints?
This is the sample PDU with chinese character but should be with errors..
0041000B910661345542F60000A00500030302010008044F60597D
Thanks.

Your DCS is wrong.
0041000B910661345542F6000*0*A00500030302010008044F60597D
should be
0041000B910661345542F6000*8*A00500030302010008044F60597D
for a DCS of 0x08 = UCS-2 encoding.

Related

UCS-2 unknown characters

From below link i can see some unknown characters of UCS-2. What are those?
Why are those unknown? So we cannot decode them?
http://www.columbia.edu/kermit/ucs2.html
Basically user is sending an ucs-2, dcs 8 message to our router. But when i decode it, then i am getting some junk characters. Ex: D83E DD13 --> this is printed as ? or some junk, how to print and view them in proper value in text file.
Thanks & regards,
Ashwini

why is there a '=' at the end of a SMTP message body?

I receive email messages over sockets and see that long lines in the message body are broken up, separated by the following expression
'=\r\n'
I cannot find any documentation on this and wonder if someone just happens to know where I can find information on this behavior.
Also, please ONLY feedback on my question, no comments regarding email and sockets!
Thanks
Alex
From Wikipedia, regarding Quoted-printable:
Lines of quoted-printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an "=" at the end of an encoded line, and does not appear as a line break in the decoded text.
The \r\n is likely coming from whatever is generating the content or body of the email, and is a line break also. Depending on the client used to view the message, it may or may not render as an actual line break.

What means Zend_Mime::ENCODING_8BIT when sending mails with Zend_Mail?

In the example for Zend_Mail on http://framework.zend.com/manual/en/zend.mail.attachments.html they use ENCODING_8BIT but searching for what that might be sends me to http://msdn.microsoft.com/en-us/library/ms526992%28EXCHG.10%29.aspx were (and this sounds logical to me) it is explained that 8bit encoding does not make sense for emails.
Edit:
When I use this encoding for a mail with an attachment, I receive the mail with a corrupted attachment in my mail software (Thunderbird)
In which cases does it make sense to use ENCODING_8BIT?
As everybody said, ENCODING_8BIT represents the Content Transfer Encoding.
Basically, 8BITMIME is used for Internationalization. It's using a 8-bit character sets and therefore, allow you to send any character supported in the UTF8 charset.
In general, non-MIME mailers send 8-bit data but do not include any
MIME headers to mark the message as 8-bit data. MIME mailers should
cope with this without any problems. [source]
So basically there is not really a case where it makes sense to use ENCODING_8BIT over another encoding since emails in UTF8 are a standard today. Also, note that most of the MTAs (Message Transfer Agent, such as Postfix, etc.) automatically force the encoding to 8BITMIME (UTF-8).
Here is a good resource about the 8BITMIME encoding.
The 8BITMIME extension has two effects in practice:
The client will avoid Q-P conversion.
The client may add extra
information at the end of a MAIL request: a space followed by either
"BODY=7BIT" or "BODY=8BITMIME".
Zend_Mime::ENCODING_8BIT sets the Content-Transfer-Encoding.
The Content-Transfer-Encoding defines methods for representing binary data in ASCII text format.
The use of Zend_Mime::ENCODING_8BIT in the example is a Bug.
For sending Attachments you should always use Zend_Mime::ENCODING_BASE64
Not for email but for attachements. If you take a look on the RFC 2045 at page 7:
RFC2045
"Binary data" refers to data where any
sequence of octets whatsoever is
allowed.

RS-232C and Email in 7bit char set

The book "Designing Embedded Hardware" in the chapter "9.3. Old Faithful: RS-232C" mentions that emails are still sent in 7bit char set because of RS-232C:
It's also not unheard of to see
RS-232C systems still using 7-bit data
frames (another leftover from the
'60s), rather than the more common
8-bit. In fact, this is one of the reasons why you'll
still see email being sent on the
Internet limited to a 7-bit character
set, just in case the packets happen
to be routed via a serial connection
that supports only 7-bit
transmissions.
How can I confirm the observation?
Check out the spec. The original rfc822, for ARPA Internet Text Messages, explicitly states:
A message consists of header fields
and, optionally, a body. The body is
simply a sequence of lines containing
ASCII characters.
Since ASCII is 7-bit, voila.
Note, however, that there are a whole bunch of additions to that original spec, all the MIME extensions, which allow message header extensions for non-ascii text.
The Quoted-printable MIME encoding is specifically designed to encode 8-bit data in 7-bit characters. This encoding is widely used to encode email.
Note also that the text you quoted says "in case the packets happen to be routed via a serial connection" which is misleading, especially if they're talking in a context of IP packets. IP packets assume an 8-bit data path, and cannot be sent directly over a 7-bit RS-232 link without additional encoding (and then it's not a 7-bit data path anymore, it's 8-bit).
The systems that were restricted to 7 bits were already old when email first became popular. The chances that you will find one today approach zero.
Since certain characters have special meaning to email programs (most notably the end-of-line character), it still makes sense to limit the character set.

How can I figure out what code page I am looking at?

I have a device with some documentation on how to send it text. It uses 0x00-0x7F to send 'special' characters like accented characters, euro signs, ...
I am guessing they copied an existing code page and made some changes, but I have no idea how to figure out what code page is closest to the one in my documentation.
In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find some way to go through all code pages and find the ones that have this character on that position, it would be a piece of cake.
However, all I can find on the internet are links to code page dumps just like the one I'm looking at, or software that uses heuristics to read text and guess the most likely code page. Surely someone out there has made it possible to look up what code page one is looking at ?
If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?
In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).
If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.
There is no way to auto-detect the codepage without additional information. Below the display layer it’s just bytes and all bytes are created equal. There’s no way to say “I’m a 0x41 from this and that codepage”, there’s only “I’m 0x41. Display me!”
What endian is the system? Perhaps you're flipping bit orders?
In most codepages, 0x41 is just the normal "A", I don't think any standard codepages have "Á" in that position. It could have a control character somewhere before the A that added the accent, or uses a non-standard codepage.
I don't see any use in knowing the "closest codepage", you just need to use the docs you got with the device.
Your last sentence is puzzling, what do you mean by "possible to look up what code page one is looking at"?
If you include your whole codepage, people here on SO could be more helpful and give you more insight about this issue, having one data point 0x41=Á doesn't help much.
Somewhat random idea, but if you can get replicate a significant amount of the text off the device, you could try running it through something like the detect function in http://chardet.feedparser.org/.