How to reconstruct conversations or group emails? - email

I am having a PST files which contains the email history of a user. The task is to read this PST file and reconstruct the email history to display it in a client. This includes the correctly displaying of conversations as you know it from Email clients:
Meeting at 8:00 07:34 am
AW: Meeting at 8:00 09:12 am
AW: AW: Meeting at 8:00 13:45 pm
[Jenkins Build] Success 11:54 am
[Jenkins Build] Failed 12:13 pm
[Jenkins Build] Success 01:12 pm
[Jenkins Build] Success 10:34 am
[Jenkins Build] Failed 12:12 pm
[Jenkins Build] Success 05:12 pm
However, I don't know how I could do this reliably.
I am using java-libpst (see Official Documentation) which provides a PSTMessage object. There is a method getConversationId() but that appears to be just a string of the original subject of that message which means that there might be duplicates (e.g. [Jenkins Build]*).
So, I am not sure how Outlook is able to reconstruct conversations and whether this is trivial but if there is actually a simple method to do this which I am just overlooking I'd be happy if somebody would let me know - otherwise this will end up in me parsing a ton of subject fields, parsing them and trying to match emails by their subject with the danger of missing different conversations which just have the same subject coincidentally.

I think you will need to construct the conversations yourself. You might find the source code referenced on this page about the Netscape Mail message threading algorithm helpful.
I copied the source code to Github. Here's the email Threader.java file.
Here is someone offering an explanation of how Gmail constructs conversations My gist is:
Emails coming after an email with an equivalent subject, from any of the participants in any previous email, are part of the same conversation.
The in-reply-to email field can create participants to an email conversation even if they weren't an explicit participant.
Where:
equivalent subject means either an identical subject, or a subject that would result replying or forwarding. I.e. "FW: X", "RE: X", "Fwd: X", etc.
explicit participants in an email: the sender or any email appearing in a TO: or CC: field. (Maybe a BCC: field too...)
participants in an email: explicit participants in an email or anyone who has sent a later email using the in-reply-to field.
participants in any previous email: the distinct emails that are participants in email with an earlier send date having equivalent subject to a current email.
Here's another exposition of email fields relevant to email threading. What I took from this is that the References header should also be consulted in addition to the in-reply-to header, and that it is more reliable. (Maybe, if present, it should supercede the in-reply-to header.

Related

I want to forward an Outlook email with all text after a particular string deleted

I am using Outlook 365. When I forward received emails, I want all text after a particular string ('regards' in the example below) to be deleted.
For example, I receive the following email which I want to forward:
From: Sender Gmail sender#gmail.com
Sent: Saturday, 18 February, 2023 02:58
To: Receiver receiver#gmail.com
Subject: DTP Query
Hello Receiver,
Hope you are doing well.
Please review the project and let me know the cost and turnaround time.
Best regards
ABC Enterprises
New Delhi, India
abc#gmail.com
I want the header removed:
From: Sender Gmail sender#gmail.com
Sent: Saturday, 18 February, 2023 02:58
To: Receiver receiver#gmail.com
Subject: DTP Query
I want to retain:
Hello Receiver,
Hope you are doing well.
Please review the project and let me know the cost and turnaround time.
Best regards
And all information after regards to be removed:
ABC Enterprises
New Delhi, India
abc#gmail.com
Is there a way to do this?
Dhiraj Aggarwal
New Delhi, India
I am not a technical guy, so didn't try anything.

o365 email account: Programatically accessing To Alias in Internet Headers

There is a problem accessing the To alias from a o365 account IF the from account is also o365. If the from account is say, gmail, it works.
If I send an email to alias#mycompany.com which is an alias to realAccount#mycompany.com, if I examine the To header in Outlook, it will always show me the original alias. If I view the header progrmatically, it will NOT show the alias if it was sent from an o365 account. Instead, it shows the real account. If I do this same test with a gmail instead of an o365 email it works -- shows the alias in the To: header as expected.
How does Outlook access this data? The number of headers are different too. Outlook contains more data. Has anyone experienced this? Any ideas on how to access the alias like Outlook does?
Header when accessing from Outlook:
From: o365Account#somecompany.com
To: ***************** alias#mycompany.com ****************
Subject: shdaKJSDHA
Thread-Topic: shdaKJSDHA
Thread-Index: AQHUSTkz1fQhzI5SG0ie26mNIvHmmQ==
Date: Mon, 10 Sep 2018 19:05:12 +0000
Message-ID: <---#-----.prod.outlook.com>
Header when accessed programatically:
From: o365Account#company.com
To: *****************realAccount#mycompany.com ****************
Subject: shdaKJSDHA
Thread-Topic: shdaKJSDHA
Thread-Index: AQHUSTkz1fQhzI5SG0ie26mNIvHmmQ==
X-MS-Exchange-MessageSentRepresentingType: 1
Date: Mon, 10 Sep 2018 19:05:12 +0000
Message-ID: <----#-----.prod.outlook.com>
Keep in mind that Exchange always resolves the sender and recipients to their primary addresses both when sending and when receiving the messages. This is just the way it works.
Are you sending through SMTP?
You know what I ended up doing? I made a Distro list with 1 recipient. The distro list takes the place of the alias. It always shows the To: as the distro list. That way it doesn't get lost. Thank you for helping me understand Exchange Server dmitry-streblechenko.

How to read only the latest replied email text in javamail

Read only the latest replied email text.
----------
I am good
On Saturday 27 January 2018 10:32 AM, xyz.sms wrote:
----hello sai how are you ?
How to retrieve "i am good" ? Any direct way to retrieve this? Please help.
There's no standard for this. It's just a string. You have to guess what part of the string is an included or referenced message and which part is the reply. Different mailers use different techniques to include the original message in the body of a reply message.

Cleaning Emails for Custom Email System

I have noticed the various email clients prepend/append text to the text written by the user. For example, Gmail seems to prepend the following text to all email bodies:
"On Tue, Jul 14, 2015 at 11:41 AM, Jonny Bravo wrote: >"
The added text differs based on the client. I am not interested in this information. I would like to be able to extract the message body from the text with an approach that is relatively cross-platform. Does anything like this exist? Is the best solution to clean the text on a case-by-case basis?
We had the same problem at mailparser.io when we developed our "last reply" filter. We get very decent results by just checking against a set of regular expression.
The regular expressions we use are:
'/^(--)$/ms', // -- Signature break
'/^(-----(.+))$/ms', // ----- reply above
'/^(From:(.+))$/ms', // From:
'/^(On\s(.+)wrote:)$/ms', // On DATE, NAME <EMAIL> wrote:
'/^(Sent from(.+))$/ms', // Sent from (iPhone / iPad / Windows Mail ...)
With those you should actually catch most cases produced by e-mail clients which have their language set to English.

What does X-Sender-Id mean in email raw source (Found in phishing email)?

Somebody in my company is being subject to phishing. My first suggestion was just to change the password. However after awhile I received a fake mail from her address again.
Looking at the raw source of the email I found that there is another person's email in X-Sender-ID and I'm wondering who that might be. Is that the person who sent the email or can it be an account that has been hijacked? (I replaced the email with "somebody#host.com")
X-Virus-Scanned: OK
Received: by smtp5.relay.iad3a.emailsrvr.com (Authenticated sender: somebody-AT-host.com) with ESMTPA id DF2788019C;
Fri, 21 Nov 2014 07:54:42 -0500 (EST)
X-Sender-Id: somebody#host.com
Received: from smtp.emailsrvr.com ([UNAVAILABLE]. [2.133.148.211])
by 0.0.0.0:587 (trex/5.3.2);
Fri, 21 Nov 2014 12:54:46 GMT
What is X-Sender-ID? And what is the email it contains?
My deliberations are based on this RFC which describes the Privacy Enhancement for Emails which you are obviously using.
Basically it says about the X-Sender-ID:
[...] encapsulated header field, required for all
privacy-enhanced messages, identifies a message's sender and provides
the sender's IK identification component.
What does this mean?
First of all you have to check if the mail is properly signed. If thats the case you can be sure that somebody#host.com has a certificate. And you can be sure that the mail you received has been sent from this mail address.
I can't tell you the consequences which result out of this fact as I don't know how your company is deploying the certificates etc. ... the mail address/certificate could also have been hacked and thereby abused.
I hope this helps you for your further research.
While #LMF's answer is useful technical information, I'd like to offer a possible alternative explanation.
Spammers who are not familiar with e-mail (and PHP programmers with no other malicious intent) tend to succumb to cargo cult programming when it comes to email headers. In other words, if there is something they don't understand, they might think it does something useful, and include it in their message template.
Without knowledge about your email infrastructure, or other messages of yours to compare to, I would simply assume everything below the top-most Received: header is forged, and basically without meaning.
If you have a system which runs something called trex (maybe this one?) and it really manages to write a Received: header like that, I might be wrong. The format needlessly deviates from the de-facto standard Sendmail template in a few places, but it's not technically wrong (the format is basically free-form, but introducing ad-hoc syntax makes it harder to guess what the fields mean).
Again, more information about what your typical email (and your correspondent's typical mail) looks like, this is heavy on speculation.
The x-sender-id, along with the x-recipient-id are used to specify which interchange key was used in the broadcast of the message.
X-Sender-ID entity_id : issuing_authority : version
X-Recipient-ID entity_id : issuing_authority : version
The first field contains the identity of the sender or receiver. The first field is mandatory, must be unique, and must be formatted as user#host whereas the host is a fully qualified host address.
The second identifies the name of the authority which issued the interchange key.
The third field specifies the specific type of interchange key which was used. This is represented by an alphanumeric string defined by the issuing authority to label and organize the numerous interchange keys issued by that authority. It is recommmended that they use a timestamp but is not always the case.
If the field values of the x-sender-id second and third field are identical to that of the x-recipient-id they may be only listed in the field which is defined last.
Further Reading
"Distributed Computing & Cryptography: Proceedings of a DIMACS Workshop"