Convert 7bit text to plain text perl

Convert 7bit text to plain text perl - perl

I parsing email message and i found part with encoding: 7bit
how a can convert text of this part to plain text?
i use perl

Content-Transfer-Encoding: 7bit
means that the text is already plain old ASCII text. No conversion is necessary. (Well, unless the Content-Type header indicates a non-ASCII-based charset, but those are pretty rare, especially with 7bit text.)

It sounds like you have UU-encoded data (older method) or MIME-encoded. To deal with that, you can use Convert::UU and MIME::Base64 CPAN modules respectively.
To use MIME::Base64 (or its pure-Perl implementation, MIME::Base64::Perl):
use MIME::Base64::Perl;
my $decoded = decode_base64($encoded);
How do you know the difference?
The modern MIME encoded text looks like this (Especially pay attention to MIME-Version: header which tells you it's MIME-encoded as well as Content-Transfer-Encoding header which tells you the encoding base - if it's not base64, you need a different CPAN module:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="frontier"
This is a message with multiple parts in MIME format.
--frontier
Content-Type: text/plain
This is the body of the message.
--frontier
Content-Type: application/octet-stream
Content-Transfer-Encoding: base64
UU-encoded text would look something like:
begin 644 cat.txt
#0V%T
`
end
If the encoded data looks differently than either of the above samples, please post the exact format so we can determine what it is.

Related

What is the proper non ascii (UTF-8) email message text encoding in a multipart message

I would like to send an email message with attachments. So the message should be multipart.
The encoded message body looks like:
From: another#address
To: an#address
Subject: Unimportant message
Content-Type: multipart/mixed; boundary="----=_Part_0_1457006650.1670256299458"
...
------=_Part_0_1457006650.1670256299458
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Hello,
=E2=80=9ESomething here=E2=80=9C
------=_Part_0_1457006650.1670256299458
Content-Type: image/png; name=sample.png
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=sample.png
Content-Description: sample.png
iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAIAAAACUFjqAAABhGlDQ1BJQ0MgcHJvZmlsZQAAKJF9
...
FElEQVQY02P8z8WABzAxMIxKYwIATTQBHSBDi6AAAAAASUVORK5CYII=
------=_Part_0_1457006650.1670256299458--
Web MS Outlook and Gmail do not display non ASCII characters correctly showing a placeholder for each UTF-8 byte in the message body. For example:
I tried different Transfer-Message-Encoding values and different multipart layouts (multiple top-level sections, and nested as parts inside the root-multipart) but the result seem the same - it seems that Web Outlook does take into account the transfer encoding, but does not recognize the message text encoding as UTF-8.
Is this a problem with Outlook Web? Or the message should have some additional meta information or different multipart layout?
I probably could use an HTML encoding, but the message is generated from a template, so all non-ASCII symbols will have to be converted to entities automatically, if I understand correctly. We do not need any fancy formatting besides plain text message, so that option seems over-complicated.

Can an email header have different character encoding than the body of the email?

Is an email with different character encoding for it's header and body valid?
The Use Case: While processing an email, should I check for the character encoding of it's header separately, or will checking that of it's body be sufficient?
Can someone guide me as to how to figure this out?
Thanks in advance!

Email headers should use the ASCII charset, if you want the header fields to have a different encoding you need to use the encoded word syntax: http://en.wikipedia.org/wiki/MIME#Encoded-Word
The email body can be directly encoded in different encoding only if mail servers that transfer it have 8bit mime enabled (nowadays every mail server should have it enabled, but it's not guaranteed), otherwise you need to encode the body in transfer encoding (quoted-printable or base64)
The charset can be different in each case, that is you can have every encoded word in different charset and every mail part encoded in different charset or even different transfer encoding as well.
For example you can have:
Subject: =?UTF-8?Q?Zg=C5=82oszenie?= //header value in UTF-8 encoded with quoted printable
and the body encoded:
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: base64
WmG/87PmIEfqtmyxIEphvPE=
different charsets, different transfer encodings in the same email, no problem.
From experience I can tell you that such mails are very common. Even worse, you can get an email that states one charset in Content-Type header and another charset in html body meta tag:
Content-Type: text/html; charset="iso-8859-2"
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charser=utf-8">
It's up to you to guess the actual charset used. Probably it's the one in meta tag.
Assume nothing. Expect everything. Take no prisoners. This is Sparta.

Is Content-Transfer-Encoding needed for multipart/alternative Content-Type?

I have an app that sends e-mails and for many months, it was working fine. I recently had problems with utf-8 encoded emails sent to iPhone Exchange account (i.e. not IMAP).
All the receiver had to see was a big bunch of meaningless characters like LS0tLS0tPV9QYXJ0XzE0N18[....].
Comparing my email headers with Gmail, I noticed that I had an extra Content-Transfer-Encoding associated with the Content-Type: multipart/alternative;.
My email would look like
Delivered-To: ...
Received: ...
...
MIME-Version: ...
Content-Type: multipart/alternative;
boundary="----=_boundary"
Content-Transfer-Encoding: Base64 # <= the extra setting
----=_boundary
Content-type: text/plain; charset=utf-8
Content-Transfer-Encoding: Base64
YmVu0Cg==
----=_boundary
Content-type: text/html; charset=utf-8
Content-Transfer-Encoding: Base64
PGh0bWwgeG1sbnM6bz0iIj48aGVhZD48dGl0bGU+PC90aXRsZT48L2hlYWQ+PGJvZHk+YmVub2l0
PC9ib2R5PjwvaHRtbD4NCjx9IjAiIC8+Cg==
----=_boundary
If I remove the extra setting, my email is received and display properly.
My questions:
Is the Encoding setting basically needed with Content-Type: multipart/alternative;, even for specific cases ?
Is it safe to remove it and just keep using my app as I used to ?
Edit
I found on IETF:
Encoding considerations: Multipart content-types cannot have encodings.
But I also found on Wikipedia:
The content-transfer-encoding of a multipart type must always be
"7bit", "8bit" or "binary" to avoid the complications that would be
posed by multiple levels of decoding.
Isn't it contradictory ?

The statements from IETF and wikipedia aren't really contradictory. 7bit, 8bit, or binary aren't really content encodings in that they don't specify any transformation of the content. They simply state that the content hasn't been encoded. In the case of 7bit it also specifies that the content doesn't need to be encoded even if the message needs to be sent over a transport that isn't 8-bit clean.
Only the bottom-most layers of messages should have an actual Content-Transfer-Encoding such as base64 or quoted-printable. In the message that you quote from the outer portion certainly isn't base64 encoded, so stating that it is was not only violating the standard but also incorrect. That would certainly be expected to cause problems with display of that message.

In practice, each part of a multipart has its own encoding, and it doesn't make sense to declare one for the multipart container. I cannot make sense of the Wikipedia quote anyway; in any event, it is hardly authoritative.

MS Entourage 2008 and quoted-printable encoding

I need to send an HTML email. All email clients (Outlook, Thunderbird ..) but Entourage can receive and read this email without major problems. Entourage, though is breaking the content and displays just few lines from the beginning.
My guess is that it has something to do with the way how Entourage handles quoted-printable encoding. The important headers of email as they are set:
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
The same behaviour in Entourage occurs when email is sent as multipart/alternative with alternative plain text.
The content of the email is displayd until the character =00 occurs (encoded NUL?).
Is this Entourage bug behaviour? Or am I doing something wrong?

The problem is indeed those *=00* characters. Before sending the email, you need to prepare it for quoted-printable encoding and remove all null characters.
$str = preg_replace('/\x00+/', '', $str);

ATTnnnnn.txt attachments when e-mail is received in Outlook

I've written an SMTP client that sends e-mails with attachments. Everything's fine except that when an e-mail sent by my program is received by Outlook it displays two attachments - the file actually sent and a file with two characters CR and LF inside and this file has name ATT?????.txt.
I've done search - found a lot of matches like this for similar problems and checked everything I could. Even more - I compared two emails - sent by my program and sent by Opera and I can't deduce the difference. However what Opera sends is interpreted correctly, but what my program sends is not. What my program sends is interpreted by a set of other mail clients correctly, but not by Outlook.
I've telnet'et to the SMTP server, retrieved the two emails into a text file - one from my program, another from Opera, and compared them side-by-side. I didn't see any difference that could affect interpretation by an email client.
Here's a sample message (addresses substituted, file contents cropped, blank lines exactly as they appear in real messages, lines never exceed 80 characters):
To: user1#host.com, user2#host.com
Subject: subject
Content-Type: multipart/mixed; boundary="------------boundary"
MIME-Version: 1.0
--------------boundary
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
here goes the Base64 encoded text part - it may be localized, so
it's better to UTF8 it and do Base64
--------------boundary
Content-Disposition: attachment; filename="file.jpg"
Content-Type: application/octet-stream; name="file.jpg"
Content-Transfer-Encoding: base64
here goes the Base64 encoded file data
--------------boundary
I tried to play with linebreaks after the last boundary - tried none, one, two, three, but this doesn't improve the situation.
Is there a set of some weird limitations that a mail client must follow to produce messages that are interpreted by Outlook correctly?

The last boundary of a MIME part must be indicated by appending two dashes:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------boundary"
--------------boundary
...
--------------boundary
...
--------------boundary--
More reading here: RFC1341 / 7.2 The Multipart Content-Type

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert 7bit text to plain text perl - perl

I parsing email message and i found part with encoding: 7bit how a can convert text of this part to plain text? i use perl

Content-Transfer-Encoding: 7bit means that the text is already plain old ASCII text. No conversion is necessary. (Well, unless the Content-Type header indicates a non-ASCII-based charset, but those are pretty rare, especially with 7bit text.)

Related

What is the proper non ascii (UTF-8) email message text encoding in a multipart message

Can an email header have different character encoding than the body of the email?

Is Content-Transfer-Encoding needed for multipart/alternative Content-Type?

MS Entourage 2008 and quoted-printable encoding

ATTnnnnn.txt attachments when e-mail is received in Outlook

Categories

Resources