In the example for Zend_Mail on http://framework.zend.com/manual/en/zend.mail.attachments.html they use ENCODING_8BIT but searching for what that might be sends me to http://msdn.microsoft.com/en-us/library/ms526992%28EXCHG.10%29.aspx were (and this sounds logical to me) it is explained that 8bit encoding does not make sense for emails.
Edit:
When I use this encoding for a mail with an attachment, I receive the mail with a corrupted attachment in my mail software (Thunderbird)
In which cases does it make sense to use ENCODING_8BIT?
As everybody said, ENCODING_8BIT represents the Content Transfer Encoding.
Basically, 8BITMIME is used for Internationalization. It's using a 8-bit character sets and therefore, allow you to send any character supported in the UTF8 charset.
In general, non-MIME mailers send 8-bit data but do not include any
MIME headers to mark the message as 8-bit data. MIME mailers should
cope with this without any problems. [source]
So basically there is not really a case where it makes sense to use ENCODING_8BIT over another encoding since emails in UTF8 are a standard today. Also, note that most of the MTAs (Message Transfer Agent, such as Postfix, etc.) automatically force the encoding to 8BITMIME (UTF-8).
Here is a good resource about the 8BITMIME encoding.
The 8BITMIME extension has two effects in practice:
The client will avoid Q-P conversion.
The client may add extra
information at the end of a MAIL request: a space followed by either
"BODY=7BIT" or "BODY=8BITMIME".
Zend_Mime::ENCODING_8BIT sets the Content-Transfer-Encoding.
The Content-Transfer-Encoding defines methods for representing binary data in ASCII text format.
The use of Zend_Mime::ENCODING_8BIT in the example is a Bug.
For sending Attachments you should always use Zend_Mime::ENCODING_BASE64
Not for email but for attachements. If you take a look on the RFC 2045 at page 7:
RFC2045
"Binary data" refers to data where any
sequence of octets whatsoever is
allowed.
Related
I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.
So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard
I've got an application that sends an email notification. When the email is generated, it includes the following in the mime source:
Content-Type: text/plain;
charset="ISO8859-1"
Content-Transfer-Encoding: quoted-printable
I've noticed that other email programs and open-source conversion tools (like iconv) don't support that specific spelling and instead require "ISO-8859-1".
I don't see "ISO8859-1" specifically listed on the IANA character set list: https://www.iana.org/assignments/character-sets/character-sets.xhtml
So my question is:
Is ISO8859-1 an acceptable variation name of ISO-8859-1 and is there some sort of RFC or standard available to definitively "prove" that one way or the other?
The IANA registry mentioned in the question cites RFC 2978, which in turn cites several RFCs which define how character encodings are to be specified in the Internet. Thus, since ISO8859-1 is not listed there, it is not correct to use it.
Programs may still accept it, as part of their error recovery, but they are not required to do so. Programs may do better error recovery, upon encountering an undefined character encoding name, by inspecting the actual content of text data and trying to make a guess on the encoding. Or they may simply fall back to some default encoding they use, and this may well be ISO-8859-1 (or, in fact, more often windows-1252).
Alright. I thought this problem had something to do with my rails app, but it seems to have to do with the deeper workings of email attachments.
I have to send out a csv file from my rails app to a warehouse that fulfills orders places in my store. The warehouse has a format for the CSV, and ironically the header line of the CSV file is super long (1000+ characters).
I was getting a line break in the header line of the csv file when I received the test emails and couldn't figure out what put it there. However, some googling has finally showed the reason: attached files have a line character limit of 1000. Why? I don't know. It seems ridiculous, but I still have to send this csv file somehow.
I tried manually setting the MIME type of the attachment to text/csv, but that was no help. Does anybody know how to solve this problem?
Some relevant google results : http://www.google.com/search?client=safari&rls=en&q=csv+wrapped+990&ie=UTF-8&oe=UTF-8
update
I've tried encoding the attachment in base64 like so:
attachments['205.csv'] = {:data=> ActiveSupport::Base64.encode64(#string), :encoding => 'base64', :mime_type => 'text/csv'}
That doesn't seem to have made a difference. I'm receiving the email with a me.com account via Sparrow for Mac. I'll try using gmail's web interface.
This seems to be because the SendGrid mail server is modifying the attachment content. If you send an attachment with a plain text storage mime type (e.g text/csv) it will wrap the content every 990 characters, as you observed. I think this is related to RFC 2045/821:
Content-Transfer-Encoding Header Field
Many media types which could be usefully transported via email are
represented, in their "natural" format, as 8bit character or binary
data. Such data cannot be transmitted over some transfer protocols.
For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII
data with lines no longer than 1000 characters including any trailing
CRLF line separator.
It is necessary, therefore, to define a standard mechanism for
encoding such data into a 7bit short line format. Proper labelling
of unencoded material in less restrictive formats for direct use over
less restrictive transports is also desireable. This document
specifies that such encodings will be indicated by a new "Content-
Transfer-Encoding" header field. This field has not been defined by
any previous standard.
If you send the attachment using base64 encoding instead of the default 7-bit the attachment remains unchanged (no added line breaks):
attachments['file.csv']= { :data=> ActiveSupport::Base64.encode64(#string), :encoding => 'base64' }
Could you have newlines in your data that would cause this? Check and see if
csv_for_orders(orders).lines.count == orders.count
If so, a quick/hackish fix might be changing where you call values_for_line_item(item) to values_for_line_item(item).map{|c| c.gsub(/(\r|\n)/, '')} (same for the other line_item calls).
The book "Designing Embedded Hardware" in the chapter "9.3. Old Faithful: RS-232C" mentions that emails are still sent in 7bit char set because of RS-232C:
It's also not unheard of to see
RS-232C systems still using 7-bit data
frames (another leftover from the
'60s), rather than the more common
8-bit. In fact, this is one of the reasons why you'll
still see email being sent on the
Internet limited to a 7-bit character
set, just in case the packets happen
to be routed via a serial connection
that supports only 7-bit
transmissions.
How can I confirm the observation?
Check out the spec. The original rfc822, for ARPA Internet Text Messages, explicitly states:
A message consists of header fields
and, optionally, a body. The body is
simply a sequence of lines containing
ASCII characters.
Since ASCII is 7-bit, voila.
Note, however, that there are a whole bunch of additions to that original spec, all the MIME extensions, which allow message header extensions for non-ascii text.
The Quoted-printable MIME encoding is specifically designed to encode 8-bit data in 7-bit characters. This encoding is widely used to encode email.
Note also that the text you quoted says "in case the packets happen to be routed via a serial connection" which is misleading, especially if they're talking in a context of IP packets. IP packets assume an 8-bit data path, and cannot be sent directly over a 7-bit RS-232 link without additional encoding (and then it's not a 7-bit data path anymore, it's 8-bit).
The systems that were restricted to 7 bits were already old when email first became popular. The chances that you will find one today approach zero.
Since certain characters have special meaning to email programs (most notably the end-of-line character), it still makes sense to limit the character set.