Cleaning Emails for Custom Email System

Cleaning Emails for Custom Email System - email

I have noticed the various email clients prepend/append text to the text written by the user. For example, Gmail seems to prepend the following text to all email bodies:
"On Tue, Jul 14, 2015 at 11:41 AM, Jonny Bravo wrote: >"
The added text differs based on the client. I am not interested in this information. I would like to be able to extract the message body from the text with an approach that is relatively cross-platform. Does anything like this exist? Is the best solution to clean the text on a case-by-case basis?

We had the same problem at mailparser.io when we developed our "last reply" filter. We get very decent results by just checking against a set of regular expression.
The regular expressions we use are:
'/^(--)$/ms', // -- Signature break
'/^(-----(.+))$/ms', // ----- reply above
'/^(From:(.+))$/ms', // From:
'/^(On\s(.+)wrote:)$/ms', // On DATE, NAME <EMAIL> wrote:
'/^(Sent from(.+))$/ms', // Sent from (iPhone / iPad / Windows Mail ...)
With those you should actually catch most cases produced by e-mail clients which have their language set to English.

Related

Multiple To and Cc headers in MIME message sent through LotusScript

I'm building a LotusScript agent looping through a set of documents then - based on a given condition - create mail messages with formatted html text. The recipients will be mostly Non-Notes users (Outlook etc) that's why I want to make sure that subject and message body are formatted correctly. At least one copy is sent to a Domino mail-in database, though.
The code basically creates a MimeEntity, sets "To", "CC" and "Subject" headers then puts a pre-configured message into the mail body and sends it off.
In regards to the body I experimented both with a simple MimeEntity formatted as "text/html" as well as with a multipart message (Content-Type = "multipart/alternative") with 2 child entities (1: "text/plain" without any formatting, 2: "text/html" i.e. html-formatted); in my final code I plan to go for the latter method.
What is really weird is that the recipients (using Outlook as well as other mail clients like Thunderbird) see 3 "To:" and 3 "Cc:" items instead of just one. Looking at the doc in the receiving Domino mail-in database there is only one instance of each item (i.e. SendTo and CopyTo).
Here's the message's source code (taken from Thunderbird) showing those 3 instances of each item:
Return-Path: <sendername#myorg.de>
Received: (removed info here)
Subject: =?UTF-8?B?RWluIGdlbcO8dGxpY2hlcyBzaW1wbGVzIFRlc3RtYWlsIGF1cyBTT1A=?=
MIME-Version: 1.0
Auto-Submitted: auto-generated
To: user1#orgext1.de, user2#orgext2.de
CC: my-mail-in-db#myorg.de
To: user1#orgext1.de, user2#orgext2.de
CC: my-mail-in-db#myorg.de
To: user1#orgext1.de, user2#orgext2.de
CC: my-mail-in-db#myorg.de
Message-ID: <OFBCA50979.C1582837-ONC125856E.00548385-C125856E.0054838A#MYORG.DE>
From: Lothar Mueller <sendername#myorg.de>
This the basic code creating these mails (the simple non-multipart version):
Set docMemo = db.Createdocument()
Call docMemo.Replaceitemvalue("Form", "Memo")
Set nMimeBody = docMemo.Createmimeentity()
'SendTo
Set nMimeHead = nMimeBody.Createheader("To")
Call nMimeHead.Setheaderval("user1#otherorg.de,user2#3rdorg.de")
'CopyTo
Set nMimeHead = nMimeBody.Createheader("CC")
Call nMimeHead.Setheaderval("my-mail-in-db")
'Subject
Set nMimeHead = nMimeBody.Createheader("Subject")
Call nMimeHead.Addvaltext("Subject with ä-ö-ü-ß", "UTF-8")
'html version only for simple non-multipart MIME
Call nStream.Writetext({<p style="font-weight:bold;">Some simple formatted HTML content</p>})
Call nMimeBody.Setcontentfromtext(nStream, {text/html; charset="UTF-8"}, ENC_NONE)
Call nStream.Close()
'finally send
Call docMemo.Send(False)
Now, I can work around this behavior by simply setting the recipients as plain old Notes items, like:
Call docMemo.SendTo = recipientArray
Call docMemo.CopyTo = copyArray
instead of setting those values as MIME headers. In this case there are no more multiple instances of "To" and "CC" items at the recipients' mail clients.
I know that I did this already some years ago in a different project, and back then I didn't have those problems.
Anyone having an idea what could be the cause for this? Could it be due to the Domino version in use (now it's 10.0.1 FP4, back then it was some 9.0.1 version)?

Guess I found the cause for this, at least partially:
As I mentioned in an update to my post this behavior only can be observed when the agent is running in the client as opposed to running on the server:
examining the resulting mail through Ytria's scanEZ I find that there's a difference in regards to the fields that are created:
the run-on-server version just creates the expected fields "To:" and "Cc:" which turn up as "SendTo" and "CopyTo" in the resulting Notes document
If the code is running in the client some more fields are created in the Notes document: in addition to the standard fields there are also "INetSendTo", INetCopyTo, "AltSendTo" and "AltCopyTo". I assume that those extra fields are then rendered by the router to become addition "To:" and "Cc:" header items.
Thanks again to #DaveDelay for bringing up that idea regarding the router and mail.box

What does X-Sender-Id mean in email raw source (Found in phishing email)?

Somebody in my company is being subject to phishing. My first suggestion was just to change the password. However after awhile I received a fake mail from her address again.
Looking at the raw source of the email I found that there is another person's email in X-Sender-ID and I'm wondering who that might be. Is that the person who sent the email or can it be an account that has been hijacked? (I replaced the email with "somebody#host.com")
X-Virus-Scanned: OK
Received: by smtp5.relay.iad3a.emailsrvr.com (Authenticated sender: somebody-AT-host.com) with ESMTPA id DF2788019C;
Fri, 21 Nov 2014 07:54:42 -0500 (EST)
X-Sender-Id: somebody#host.com
Received: from smtp.emailsrvr.com ([UNAVAILABLE]. [2.133.148.211])
by 0.0.0.0:587 (trex/5.3.2);
Fri, 21 Nov 2014 12:54:46 GMT
What is X-Sender-ID? And what is the email it contains?

My deliberations are based on this RFC which describes the Privacy Enhancement for Emails which you are obviously using.
Basically it says about the X-Sender-ID:
[...] encapsulated header field, required for all
privacy-enhanced messages, identifies a message's sender and provides
the sender's IK identification component.
What does this mean?
First of all you have to check if the mail is properly signed. If thats the case you can be sure that somebody#host.com has a certificate. And you can be sure that the mail you received has been sent from this mail address.
I can't tell you the consequences which result out of this fact as I don't know how your company is deploying the certificates etc. ... the mail address/certificate could also have been hacked and thereby abused.
I hope this helps you for your further research.

While #LMF's answer is useful technical information, I'd like to offer a possible alternative explanation.
Spammers who are not familiar with e-mail (and PHP programmers with no other malicious intent) tend to succumb to cargo cult programming when it comes to email headers. In other words, if there is something they don't understand, they might think it does something useful, and include it in their message template.
Without knowledge about your email infrastructure, or other messages of yours to compare to, I would simply assume everything below the top-most Received: header is forged, and basically without meaning.
If you have a system which runs something called trex (maybe this one?) and it really manages to write a Received: header like that, I might be wrong. The format needlessly deviates from the de-facto standard Sendmail template in a few places, but it's not technically wrong (the format is basically free-form, but introducing ad-hoc syntax makes it harder to guess what the fields mean).
Again, more information about what your typical email (and your correspondent's typical mail) looks like, this is heavy on speculation.

The x-sender-id, along with the x-recipient-id are used to specify which interchange key was used in the broadcast of the message.
X-Sender-ID entity_id : issuing_authority : version
X-Recipient-ID entity_id : issuing_authority : version
The first field contains the identity of the sender or receiver. The first field is mandatory, must be unique, and must be formatted as user#host whereas the host is a fully qualified host address.
The second identifies the name of the authority which issued the interchange key.
The third field specifies the specific type of interchange key which was used. This is represented by an alphanumeric string defined by the issuing authority to label and organize the numerous interchange keys issued by that authority. It is recommmended that they use a timestamp but is not always the case.
If the field values of the x-sender-id second and third field are identical to that of the x-recipient-id they may be only listed in the field which is defined last.
Further Reading
"Distributed Computing & Cryptography: Proceedings of a DIMACS Workshop"

separate email from original email using perl

When people email each other, they generally include the original email in their reply to a sender, adding a little more information each time to the email. Each email client seems to have a different way of adding the original email to a reply.
I need to parse email arriving at our mail server and try and extract the new part of the message, and I'm wondering if there is a sensible way to strip this appended (or prepended) information (the "original message") and just get the new information in a mail body? I believe sadly, that there is no encoding, the original email is simply added to the new message, but I thought I'd check with the experts?
thanks.

No, there is no simple, straightforward algorithm to separate quoted or forwarded text from new content. Quoting and forwarding are poorly standardized and different conventions have existed at different times.
Having said that, e.g. Google's Gmail succeeds fairly well in practice. With enough samples, you can clearly come up with reasonable heuristics.
Good indicators for quoted material are forwarded (pseudo-) headers and indented text, perhaps with a quote indicator along the left margin before the quoted text. You occasionally see outdents as well.
Traditionally, on Usenet in the early 1990s, people would use different, unique quoting styles.
: ~ | This seems to be the original.
: ~ This is the first reply.
: This is the second reply.
This is the third reply, quoting the
previous three messages in sequence.
Around 1995, both clients and standardization initiatives by and large converged on "wedge" quotes;
> >> This seems to be the original.
> > This is the first reply.
> This is the second reply.
This is the third reply, quoting the
previous three messages in sequence.
Then along came Microsoft and ruined it all. I suppose that top quoting makes sense in some corporate settings where you quickly need to collect all the background from a thread to a new participant, but even for that purpose it's a horrible abomination.
This is the third reply, quoting the
previous three messages in sequence.
---- Begin forwarded message ----
From: Him [smtp:bogus]
To: His Friend
Subject: VS: Re: Same as on this message
Date: nothing machine-readable
This is the second reply.
---- Alkuperäinen viesti ----
Lähettäjä: His Friend [smtp:poppycock]
Saaja: Some Guy
Aihe: Re: Same as on this message
Päivämäärä: olisiko eilen ehkä
This is the first reply.
----- Original message ----
From: Somebody Else [smtp:mindless]
To: Some Guy
Subject: Same as on this message
Date: like, the day before
This seems to be the original.

How does the email header field 'thread-index' work?

I was wondering if anyone knew how the thread-index field in email headers work?
Here's a simple chain of emails thread indexes that I messaged myself with.
Email 1 Thread-Index: AcqvbpKt7QRrdlwaRBKmERImIT9IDg==
Email 2 Thread-Index: AcqvbpjOf+21hsPgR4qZeVu9O988Eg==
Email 3 Thread-Index: Acqvbp3C811djHLbQ9eTGDmyBL925w==
Email 4 Thread-Index: AcqvbqMuifoc5OztR7ei1BLNqFSVvw==
Email 5 Thread-Index: AcqvbqfdWWuz4UwLS7arQJX7/XeUvg==
I can't seem to say with certainty how I can link these emails together. Normally, I would use the in-reply-to field or references field, but I recently found that Blackberrys do NOT include these fields. The only include Thread-Index field.

They are base64 encoded Conversation Index values. No need to reverse engineer them as they are documented by Microsoft on e.g. http://msdn.microsoft.com/en-us/library/ms528174(v=exchg.10).aspx and more detailed on http://msdn.microsoft.com/en-us/library/ee202481(v=exchg.80).aspx
Seemingly the indexes in your example doesn't represent the same conversation, which probably means that the software that sent the mails wasn't able to link them together.
EDIT: Unfortunately I don't have enough reputation to add a comment, but adamo is right that it contains a timestamp - a somewhat esoteric encoded partial FILETIME. But it also contains a GUID, so it is pretty much guarenteed to be unique for that mail (of course the same mail can exist in multiple copies).

There's a good analysis of how exactly this non-standard "Thread-Index" header appears to be used, in this post and links therefrom, including this pdf (a paper presented at the CEAS 2006 conference) and this follow-up, which includes a comment on the issue from the evolution source code (which seems to reflect substantial reverse-engineering of this undocumented header).
Executive summary: essentially, the author eventually gives up on using this header and recommends and shows a different approach, which is also implemented in the c-client library, part of the UW IMAP Toolkit open source package (which is not for IMAP only -- don't let the name fool you, it also works for POP, NNTP, local mailboxes, &c).

I wouldn't be surprised if there are mail clients out there which would not be able to link Blackberry's mails to their threads. The Thread-Index header appears to be a Microsoft extension.
Either way, Novell Evolution implements this. Take a look at this short description of how they do it, or this piece of code that finds the thread parent of a given message.
I assume that, because the lengths of the Thread-Index headers in your example are all the same, these messages were all thread starts? Strange that they're only 22-bytes, though I suppose you could try applying the 5-bytes-per-message rule to them and see if it works for you.

If you are interested in parsing the Thread-Index in C# please take a look at this post
http://forum.rebex.net/questions/3841/how-to-interprete-thread-index-header
The snippet you will find there will let you parse the Thread-Index and retrieve the Thread GUID and message DateTime. There is a problem however, it does not work for all Thread-Indexes out there. Question is why do some Thread-Indexes generate invalid DateTime and what to do to support all of them???

Correct format of an Return-Path header

My application uses sendmail to send outbound email. I set the 'From:' address using the following format:
Fred Dibnah <fred#dibnah.com>
I'm also setting the Reply-To and Return-Path headers using the exact same format.
This seems to work in the vast majority of cases but I have seen at least one instance in which this fails, namely when the name part of the above string contains a period (full stop):
Fred Dibnah, Inc. <fred#dibnah.com>
This fails deep inside the TMail code (I'm using Ruby) but it seems like a perfectly valid thing to do.
My question is, should I actually be setting the Return-Path and Reply-To headers using only the email address as opposed to the above Name + Email format? E.g.
fred#dibnah.com
Thanks.

In a situation like this, it is best to turn to the RFCs.
Upon reading up on your question, it appears as if You shouldn't be setting the Return-Path value ever. The final destination SMTP server is supposed to be setting this value as it transitions the message to your mailbox (http://www.faqs.org/rfcs/rfc2821.html starting at 4.4).
According to http://www.faqs.org/rfcs/rfc2822.html the Reply-To field can have the following formats
local-part "#" domain (fred#dibnah.com for example)
display-name (Fred Dibna for example)
I would recommend using option 1 as it seems to be the most basic, and you will likely have less issues with that format. In choosing option 1, your Reply-To field should look like the following:
Reply-To: fred#dibna.com

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cleaning Emails for Custom Email System - email

Related

Multiple To and Cc headers in MIME message sent through LotusScript

What does X-Sender-Id mean in email raw source (Found in phishing email)?

separate email from original email using perl

How does the email header field 'thread-index' work?

Correct format of an Return-Path header

Categories

Resources