parsing email message - email

Just want a basic understand of what parts a email message may have.
I know there is a messageId, date, subject, from, cc, bcc, body, etc.
Specifically I want to know how attachments and images may be embedded in the email.
At this point I think there are 2, please correct me if I am wrong.
attachments
embedded attachments/images
is that correct?

The official answer for this question is contained in RFC5322 and some related RFC's. The Wikipedia entry for email does a pretty good job of referencing the RFC numbers. To get started with MIME see RFC2045.

Attachments are encoded as multipart similar to multipart file uploads. Basically the message has a header saying there is an attachment and sets a boundary ( random string of characters to announce the start of the attachment) The boundary says when the data of the attachment starts. I think the filename is set on the boundary as well (if i remember correctly). I am doing a bit of hand waving, but this is the basic idea.
so you get somthing like
To: ...
From: ...
Content-Type: Multpart...
Content-Boundry: ewafoiuasfjasdfoashiafhj
message here
--------- Content-boundry: ewafoiuasfjasdfoashiafhj
attachement here

Related

MIME multipart emails: how do I persuade Microsoft clients to show two HTML (or even plain) parts?

Executive summary: I can construct a multipart message which contains HTML in two separate parts, and it displays correctly in multiple MUAs. However, outlook.com insists on putting any additional HTML after the first HTML part in a downloadable attachment, instead of displaying it. It also does this for plaintext parts.
In detail: I need to add a signature to an email message, where the structure of the original message is, in general, unknown. I do this by wrapping the original message in a multipart/mixed, and then adding a new multipart/alternative which contains text and html versions of the required signature.
If the original message was itself a multipart/alternative, then the new message now looks like:
multipart/mixed
multipart/alternative [the original message]
text/plain
text/html
multipart/alternative [the appended signature]
text/plain [plaintext signature]
text/html [html signature]
This displays well in various clients (Thunderbird, and webmail from gmail/Yahoo/AOL/gmx), showing the original message with the appended signature. However, it doesn't work for MS clients (I've only tried outlook.com). The two alternative signatures are presented to the user as attachments, and not inline, so the user only sees download boxes.
To get around this, I've historically done this for Microsoft:
multipart/mixed
multipart/alternative [the original message]
text/plain
text/html
text/html [html-only signature]
This worked for several years for Microsoft, but has now stopped working - the signature is again shown as an attachment.
I've spent some hours experimenting with this, and can't find any way to get outlook.com to show two different HMTL (or even plain) text parts in the same message. The second one always appears as an attachment. Some of the things I've tried are:
Replace the second multipart-alternative above with another multipart/mixed, which encloses the multipart-alternative signature
Trying to force Content-Disposition: inline for the signature: this never works, and MS appears to ignore Content-Disposition
Replace the outer multipart/mixed with multipart-related, with type=multipart/alternative
Any ideas on how I can get MS clients to actually show the signature, short of parsing the internals of the original message and re-writing it?

How can I distinguish between attachment types in exchangelib?

I've just noticed that Microsoft OWA does not display some attachments. Some people use images in their footer (which are attachments). I'm not sure if the only difference between a "normal" attachment and this emedded attachment is that it is embedded in the email.
Is there another difference? How can I get only attachments which OWA* displays as attachements?
* and probably most other email clients; I think I've seen a similar behavior in Google Mail
Those attachments have a content_id. They are referenced within the mail.body as cid:[CONTENT-ID]. The content_id looks like this:
cid:image001.jpg#01D3151A.F9036A80
where image001.jpg is the filename.
looking for cid:image_name inside mail body fails for embedded images with src referring to a link rather than cid.
so the best solution would be to use attachments.is_inline property which is built in exchangelib.
for attachment in msg.attachments:
if msg.has_attachments == True:
if isinstance(attachment, FileAttachment):
if attachment.is_inline:
print("Embeded Image")
else:
print("Normal Attachment")
reference: https://github.com/ecederstrand/exchangelib/issues/562

Can inline images be referenced from a different message in a thread?

My research through the RFC says that you can reference inline content from other mail parts using the cid: token. I also know that you can use the mid: token in a similar for message-ID. When referencing a message-ID, you can reference mail parts of another message by doing mid:messageId/contentId, contentId being a valid contentId in the target message.
I'm leaning towards no, inline images (or other inline content) can't be referenced and displayed in entirely different messages. But if that's true, I can't piece together what the purpose of using mid: is.
A simple visualization of what I'm imagining is this:
Given a multipart message with an html body and inline image... our cid reference would look like:
<img src="cid:abcd-i-am-a-content-id">
This assumes we do in fact have a multipart/related with a mail part that has some valid image payload with a matching content-ID.
What if we were replying to this original message, can I do something like:
<img src="mid:original-message-id/abcd-i-am-a-content-id"> to inline this resource that would presumably be accessible by the client's mail store that belongs to the recipient assuming all other normal threading rules are followed?
No, there's no way to do that and expect it to work.
Even "mid:" won't work in most clients.

sendmailR: Submit encoded message to local SMTP server

I need your help in order to send email message that includes text in Greek, from within R, using the function sendmail {sendmailR}.
I tried using the function iconv, like that but it didn't work
subject <- iconv("text in greek", to = "CP1253")
sendmail(from, to, subject, msg, control=list(smtpServer="blabla"))
The mail arrives immediately but the greek characters are unreadable. Any ideas?
EDIT
Another question that came up:
The second argument to accepts one recipient. What if want to send it to more than one? (I think 'll try sapply ing the sendmail function to a vector of recipients) - Ok, that worked. However, I'm not completely satisfied because each one of the recipients has no way to know who else has received the message.
Mail client won't be able to understand any encoding without Content-Type: charset=..., so you must add it:
msg<-iconv("text in greek", to = "utf8");
sendmail(from, to, subject, msg,
control=list(smtpServer="blabla"),
headers=list("Content-Type"="text/plain; charset=UTF-8; format=flowed")
);
that is for UTF8 (which I believe should be used), for CP1253:
msg<-iconv("text in greek", to = "CP1253");
sendmail(from, to, subject, msg,
control=list(smtpServer="blabla"),
headers=list("Content-Type"="text/plain; charset=CP1253; format=flowed")
);
multisend by hidden copies can also be done with header magick, still I think sapply loop is a better idea -- then the user will see that the mail was send directly to her/himself.

How does the email header field 'thread-index' work?

I was wondering if anyone knew how the thread-index field in email headers work?
Here's a simple chain of emails thread indexes that I messaged myself with.
Email 1 Thread-Index: AcqvbpKt7QRrdlwaRBKmERImIT9IDg==
Email 2 Thread-Index: AcqvbpjOf+21hsPgR4qZeVu9O988Eg==
Email 3 Thread-Index: Acqvbp3C811djHLbQ9eTGDmyBL925w==
Email 4 Thread-Index: AcqvbqMuifoc5OztR7ei1BLNqFSVvw==
Email 5 Thread-Index: AcqvbqfdWWuz4UwLS7arQJX7/XeUvg==
I can't seem to say with certainty how I can link these emails together. Normally, I would use the in-reply-to field or references field, but I recently found that Blackberrys do NOT include these fields. The only include Thread-Index field.
They are base64 encoded Conversation Index values. No need to reverse engineer them as they are documented by Microsoft on e.g. http://msdn.microsoft.com/en-us/library/ms528174(v=exchg.10).aspx and more detailed on http://msdn.microsoft.com/en-us/library/ee202481(v=exchg.80).aspx
Seemingly the indexes in your example doesn't represent the same conversation, which probably means that the software that sent the mails wasn't able to link them together.
EDIT: Unfortunately I don't have enough reputation to add a comment, but adamo is right that it contains a timestamp - a somewhat esoteric encoded partial FILETIME. But it also contains a GUID, so it is pretty much guarenteed to be unique for that mail (of course the same mail can exist in multiple copies).
There's a good analysis of how exactly this non-standard "Thread-Index" header appears to be used, in this post and links therefrom, including this pdf (a paper presented at the CEAS 2006 conference) and this follow-up, which includes a comment on the issue from the evolution source code (which seems to reflect substantial reverse-engineering of this undocumented header).
Executive summary: essentially, the author eventually gives up on using this header and recommends and shows a different approach, which is also implemented in the c-client library, part of the UW IMAP Toolkit open source package (which is not for IMAP only -- don't let the name fool you, it also works for POP, NNTP, local mailboxes, &c).
I wouldn't be surprised if there are mail clients out there which would not be able to link Blackberry's mails to their threads. The Thread-Index header appears to be a Microsoft extension.
Either way, Novell Evolution implements this. Take a look at this short description of how they do it, or this piece of code that finds the thread parent of a given message.
I assume that, because the lengths of the Thread-Index headers in your example are all the same, these messages were all thread starts? Strange that they're only 22-bytes, though I suppose you could try applying the 5-bytes-per-message rule to them and see if it works for you.
If you are interested in parsing the Thread-Index in C# please take a look at this post
http://forum.rebex.net/questions/3841/how-to-interprete-thread-index-header
The snippet you will find there will let you parse the Thread-Index and retrieve the Thread GUID and message DateTime. There is a problem however, it does not work for all Thread-Indexes out there. Question is why do some Thread-Indexes generate invalid DateTime and what to do to support all of them???