Read only the latest replied email text.
----------
I am good
On Saturday 27 January 2018 10:32 AM, xyz.sms wrote:
----hello sai how are you ?
How to retrieve "i am good" ? Any direct way to retrieve this? Please help.
There's no standard for this. It's just a string. You have to guess what part of the string is an included or referenced message and which part is the reply. Different mailers use different techniques to include the original message in the body of a reply message.
Related
I am having a PST files which contains the email history of a user. The task is to read this PST file and reconstruct the email history to display it in a client. This includes the correctly displaying of conversations as you know it from Email clients:
Meeting at 8:00 07:34 am
AW: Meeting at 8:00 09:12 am
AW: AW: Meeting at 8:00 13:45 pm
[Jenkins Build] Success 11:54 am
[Jenkins Build] Failed 12:13 pm
[Jenkins Build] Success 01:12 pm
[Jenkins Build] Success 10:34 am
[Jenkins Build] Failed 12:12 pm
[Jenkins Build] Success 05:12 pm
However, I don't know how I could do this reliably.
I am using java-libpst (see Official Documentation) which provides a PSTMessage object. There is a method getConversationId() but that appears to be just a string of the original subject of that message which means that there might be duplicates (e.g. [Jenkins Build]*).
So, I am not sure how Outlook is able to reconstruct conversations and whether this is trivial but if there is actually a simple method to do this which I am just overlooking I'd be happy if somebody would let me know - otherwise this will end up in me parsing a ton of subject fields, parsing them and trying to match emails by their subject with the danger of missing different conversations which just have the same subject coincidentally.
I think you will need to construct the conversations yourself. You might find the source code referenced on this page about the Netscape Mail message threading algorithm helpful.
I copied the source code to Github. Here's the email Threader.java file.
Here is someone offering an explanation of how Gmail constructs conversations My gist is:
Emails coming after an email with an equivalent subject, from any of the participants in any previous email, are part of the same conversation.
The in-reply-to email field can create participants to an email conversation even if they weren't an explicit participant.
Where:
equivalent subject means either an identical subject, or a subject that would result replying or forwarding. I.e. "FW: X", "RE: X", "Fwd: X", etc.
explicit participants in an email: the sender or any email appearing in a TO: or CC: field. (Maybe a BCC: field too...)
participants in an email: explicit participants in an email or anyone who has sent a later email using the in-reply-to field.
participants in any previous email: the distinct emails that are participants in email with an earlier send date having equivalent subject to a current email.
Here's another exposition of email fields relevant to email threading. What I took from this is that the References header should also be consulted in addition to the in-reply-to header, and that it is more reliable. (Maybe, if present, it should supercede the in-reply-to header.
I have noticed the various email clients prepend/append text to the text written by the user. For example, Gmail seems to prepend the following text to all email bodies:
"On Tue, Jul 14, 2015 at 11:41 AM, Jonny Bravo wrote: >"
The added text differs based on the client. I am not interested in this information. I would like to be able to extract the message body from the text with an approach that is relatively cross-platform. Does anything like this exist? Is the best solution to clean the text on a case-by-case basis?
We had the same problem at mailparser.io when we developed our "last reply" filter. We get very decent results by just checking against a set of regular expression.
The regular expressions we use are:
'/^(--)$/ms', // -- Signature break
'/^(-----(.+))$/ms', // ----- reply above
'/^(From:(.+))$/ms', // From:
'/^(On\s(.+)wrote:)$/ms', // On DATE, NAME <EMAIL> wrote:
'/^(Sent from(.+))$/ms', // Sent from (iPhone / iPad / Windows Mail ...)
With those you should actually catch most cases produced by e-mail clients which have their language set to English.
Somebody in my company is being subject to phishing. My first suggestion was just to change the password. However after awhile I received a fake mail from her address again.
Looking at the raw source of the email I found that there is another person's email in X-Sender-ID and I'm wondering who that might be. Is that the person who sent the email or can it be an account that has been hijacked? (I replaced the email with "somebody#host.com")
X-Virus-Scanned: OK
Received: by smtp5.relay.iad3a.emailsrvr.com (Authenticated sender: somebody-AT-host.com) with ESMTPA id DF2788019C;
Fri, 21 Nov 2014 07:54:42 -0500 (EST)
X-Sender-Id: somebody#host.com
Received: from smtp.emailsrvr.com ([UNAVAILABLE]. [2.133.148.211])
by 0.0.0.0:587 (trex/5.3.2);
Fri, 21 Nov 2014 12:54:46 GMT
What is X-Sender-ID? And what is the email it contains?
My deliberations are based on this RFC which describes the Privacy Enhancement for Emails which you are obviously using.
Basically it says about the X-Sender-ID:
[...] encapsulated header field, required for all
privacy-enhanced messages, identifies a message's sender and provides
the sender's IK identification component.
What does this mean?
First of all you have to check if the mail is properly signed. If thats the case you can be sure that somebody#host.com has a certificate. And you can be sure that the mail you received has been sent from this mail address.
I can't tell you the consequences which result out of this fact as I don't know how your company is deploying the certificates etc. ... the mail address/certificate could also have been hacked and thereby abused.
I hope this helps you for your further research.
While #LMF's answer is useful technical information, I'd like to offer a possible alternative explanation.
Spammers who are not familiar with e-mail (and PHP programmers with no other malicious intent) tend to succumb to cargo cult programming when it comes to email headers. In other words, if there is something they don't understand, they might think it does something useful, and include it in their message template.
Without knowledge about your email infrastructure, or other messages of yours to compare to, I would simply assume everything below the top-most Received: header is forged, and basically without meaning.
If you have a system which runs something called trex (maybe this one?) and it really manages to write a Received: header like that, I might be wrong. The format needlessly deviates from the de-facto standard Sendmail template in a few places, but it's not technically wrong (the format is basically free-form, but introducing ad-hoc syntax makes it harder to guess what the fields mean).
Again, more information about what your typical email (and your correspondent's typical mail) looks like, this is heavy on speculation.
The x-sender-id, along with the x-recipient-id are used to specify which interchange key was used in the broadcast of the message.
X-Sender-ID entity_id : issuing_authority : version
X-Recipient-ID entity_id : issuing_authority : version
The first field contains the identity of the sender or receiver. The first field is mandatory, must be unique, and must be formatted as user#host whereas the host is a fully qualified host address.
The second identifies the name of the authority which issued the interchange key.
The third field specifies the specific type of interchange key which was used. This is represented by an alphanumeric string defined by the issuing authority to label and organize the numerous interchange keys issued by that authority. It is recommmended that they use a timestamp but is not always the case.
If the field values of the x-sender-id second and third field are identical to that of the x-recipient-id they may be only listed in the field which is defined last.
Further Reading
"Distributed Computing & Cryptography: Proceedings of a DIMACS Workshop"
When people email each other, they generally include the original email in their reply to a sender, adding a little more information each time to the email. Each email client seems to have a different way of adding the original email to a reply.
I need to parse email arriving at our mail server and try and extract the new part of the message, and I'm wondering if there is a sensible way to strip this appended (or prepended) information (the "original message") and just get the new information in a mail body? I believe sadly, that there is no encoding, the original email is simply added to the new message, but I thought I'd check with the experts?
thanks.
No, there is no simple, straightforward algorithm to separate quoted or forwarded text from new content. Quoting and forwarding are poorly standardized and different conventions have existed at different times.
Having said that, e.g. Google's Gmail succeeds fairly well in practice. With enough samples, you can clearly come up with reasonable heuristics.
Good indicators for quoted material are forwarded (pseudo-) headers and indented text, perhaps with a quote indicator along the left margin before the quoted text. You occasionally see outdents as well.
Traditionally, on Usenet in the early 1990s, people would use different, unique quoting styles.
: ~ | This seems to be the original.
: ~ This is the first reply.
: This is the second reply.
This is the third reply, quoting the
previous three messages in sequence.
Around 1995, both clients and standardization initiatives by and large converged on "wedge" quotes;
> >> This seems to be the original.
> > This is the first reply.
> This is the second reply.
This is the third reply, quoting the
previous three messages in sequence.
Then along came Microsoft and ruined it all. I suppose that top quoting makes sense in some corporate settings where you quickly need to collect all the background from a thread to a new participant, but even for that purpose it's a horrible abomination.
This is the third reply, quoting the
previous three messages in sequence.
---- Begin forwarded message ----
From: Him [smtp:bogus]
To: His Friend
Subject: VS: Re: Same as on this message
Date: nothing machine-readable
This is the second reply.
---- Alkuperäinen viesti ----
Lähettäjä: His Friend [smtp:poppycock]
Saaja: Some Guy
Aihe: Re: Same as on this message
Päivämäärä: olisiko eilen ehkä
This is the first reply.
----- Original message ----
From: Somebody Else [smtp:mindless]
To: Some Guy
Subject: Same as on this message
Date: like, the day before
This seems to be the original.
I was wondering if anyone knew how the thread-index field in email headers work?
Here's a simple chain of emails thread indexes that I messaged myself with.
Email 1 Thread-Index: AcqvbpKt7QRrdlwaRBKmERImIT9IDg==
Email 2 Thread-Index: AcqvbpjOf+21hsPgR4qZeVu9O988Eg==
Email 3 Thread-Index: Acqvbp3C811djHLbQ9eTGDmyBL925w==
Email 4 Thread-Index: AcqvbqMuifoc5OztR7ei1BLNqFSVvw==
Email 5 Thread-Index: AcqvbqfdWWuz4UwLS7arQJX7/XeUvg==
I can't seem to say with certainty how I can link these emails together. Normally, I would use the in-reply-to field or references field, but I recently found that Blackberrys do NOT include these fields. The only include Thread-Index field.
They are base64 encoded Conversation Index values. No need to reverse engineer them as they are documented by Microsoft on e.g. http://msdn.microsoft.com/en-us/library/ms528174(v=exchg.10).aspx and more detailed on http://msdn.microsoft.com/en-us/library/ee202481(v=exchg.80).aspx
Seemingly the indexes in your example doesn't represent the same conversation, which probably means that the software that sent the mails wasn't able to link them together.
EDIT: Unfortunately I don't have enough reputation to add a comment, but adamo is right that it contains a timestamp - a somewhat esoteric encoded partial FILETIME. But it also contains a GUID, so it is pretty much guarenteed to be unique for that mail (of course the same mail can exist in multiple copies).
There's a good analysis of how exactly this non-standard "Thread-Index" header appears to be used, in this post and links therefrom, including this pdf (a paper presented at the CEAS 2006 conference) and this follow-up, which includes a comment on the issue from the evolution source code (which seems to reflect substantial reverse-engineering of this undocumented header).
Executive summary: essentially, the author eventually gives up on using this header and recommends and shows a different approach, which is also implemented in the c-client library, part of the UW IMAP Toolkit open source package (which is not for IMAP only -- don't let the name fool you, it also works for POP, NNTP, local mailboxes, &c).
I wouldn't be surprised if there are mail clients out there which would not be able to link Blackberry's mails to their threads. The Thread-Index header appears to be a Microsoft extension.
Either way, Novell Evolution implements this. Take a look at this short description of how they do it, or this piece of code that finds the thread parent of a given message.
I assume that, because the lengths of the Thread-Index headers in your example are all the same, these messages were all thread starts? Strange that they're only 22-bytes, though I suppose you could try applying the 5-bytes-per-message rule to them and see if it works for you.
If you are interested in parsing the Thread-Index in C# please take a look at this post
http://forum.rebex.net/questions/3841/how-to-interprete-thread-index-header
The snippet you will find there will let you parse the Thread-Index and retrieve the Thread GUID and message DateTime. There is a problem however, it does not work for all Thread-Indexes out there. Question is why do some Thread-Indexes generate invalid DateTime and what to do to support all of them???