Verifying/testing the output of mime4j parsed content - email

I am creating a tool that is required to parse incoming MIME streams and return the email body and email attachments as separate file streams.
I am using mime4j for this purpose.
Following are the problems that I am stuck on:
How can I test whether the email body file or email attachment file that I parsed out via mime4j from MIME stream is correct?
I have a large corpus of emails available in raw mime form that I want to run my tests on and need some automated way to determine which ones might be breaking the mime parsing by mime4j and tweak the code for that.

You could decode the attachments and then re-encode them. If the re-encoded stream matches (byte-for-byte) the original, then that's a good sign that mime4j is properly handling them.

I initially parsed out a sample corpus *.eml files using mime4j. I had to manually check them for parsing errors as I had no other good choice.
Now I am using the earlier parsed out emails as testbed over which I check my parsed out results iteratively.

Related

Mailcow sieve script that removes attachments and adds a message to the body

I'm trying to find out how to remove non-whitelisted attachments (by mime type) (f.e. zip, exe, ...)
and append a message about the removed attachments.
I found this: https://superuser.com/a/1502589
And it worked to add a message to the subject.
But I cannot find out how to add a message to the body.
My plan was to use a regex on the attachment mime types and allow f.e.
text/* and application/json etc.
But I cannot find a single example how to change the body.
I'm using mailcow and sieve script (which I'm both new to).
Or is there a better way to "sanitize" emails before the get put into the inbox?
EDIT (2023-02-07) : I found this today:
Extension foreverypart.
Sieve Email Filtering: MIME Part Tests, Iteration, Extraction, Replacement, and Enclosure
https://www.rfc-editor.org/rfc/rfc5703.html \
The "replace" command is defined to allow a MIME part to be replaced
with the text supplied in the command.
Exactly what I try to do.
Now I need to find out how to install the extension and try it out.

PidTagInternetCodePage not present in msg file

Looking at the MS-OXPROPS, MS-OXCMSG and MS-OXCMAIL documentation, it is said that the user should include PidTagInternetCodePage to indicate the appropriate code page for the HTML content in order to parse it properly.
However, opening up the ole streams of the msg files, I could not find the 0x3FDE stream that indicates the code page id, but only found some semblance of a code page id in the compressed RTF stream (first line).
Am I looking at the streams wrongly or are the other properties hidden in other streams? If so, how do I look for them?
Thanks in advance.
The PidTagInternetCodePage property is not guaranteed to be present and is in no way required, especially if it is a Unicode MSG file. The HTML body can include the meta tag with encoding in the header, and even then, it won't be necessary if all Unicode characters in the HTML body are properly HTML-encoded (which is always a good idea).

Sending an e-mail through MATLAB using Microsoft Outlook

I am using the function found at the following blog: https://uk.mathworks.com/matlabcentral/answers/94446-can-i-send-e-mail-through-matlab-using-microsoft-outlook to send e-mails from my Outlook account from the MATLAB editor. It indeed works, and I can also include attachments or pictures etc.
My question is whether it is possible to send a result of another MATLAB function saved and run at the same directory. I tried calling the function like that :
sendolmail('email_address','Subject included',result of a function);
When I run this, a mistake is returned. It seems as if only strings or attachments from my computer can be sent through the function. Any ideas on how results of functions can be added and sent?
No, it's all in HTML format, so you have to save the results to a file or convert the output to an HTML formatted output (could just be a string).

Parse and display MIME multipart email on website

I have a raw email, (MIME multipart), and I want to display this on a website (e.g. in an iframe, with tabs for the HTML part and the plain text part, etc.). Are there any CPAN modules or Template::Toolkit plugins that I can use to help me achieve this?
At the moment, it's looking like I'll have to parse the message with Email::MIME, then iterate over all the parts, and write a handler for all the different mime types.
It's a long shot, but I'm wondering if anyone has done all this already? It's going to be a long and error prone process writing handlers if I attempt it myself.
Thanks for any help.
I actually just dealt with this problem just a few months ago. I added an email feature to the product I work for, both sending and receiving. The first part was sending reminders to users, but we didn't want to manage the bounce backs for our customer admins, we decided to have a message inbox that the admins could see bounces and replies without us, and the admins can deal with adjusting email addresses if they needed to.
Because of this, we accept all email that is sent to an inbox we watch. We use VERP to associate an email with a user, and store the entire email as is in the database. Then, when the admin requests to see the email, we have to parse the email.
My first attempt was very similar to an earlier answer. If one of the parts is html, show it. If it's text, show it. Otherwise, show the original, raw email. This broke down real fast with a few emails not generated by sendmail. Outlook, Exchange, and a few other email systems don't do that, they use multiparts to send the email. After a lot of digging and cussing, I discovered that the problem doesn't appear to be well documented. With the help of looking through MHonArc and reading the RFC's (RFC2045 and RFC2046), I settled on the solution below. I decided on not using MHonArc, since I couldn't easily resuse the parsing and display functionality. I wouldn't say this is perfect, but it's been good enough that we used it.
First, take the message and use Email::MIME to parse it. Then call a function called get_part with the array of parts Email::MIME gives you with ->parts().
get_part, for each part it was passed, decodes the content type, looks it up in a hash, and if it exists, call the function associated with that content type. If the decoder was able to give us something, put it on a result array.
The last piece of the puzzle is this decoder array. Basically, it defines the content types I can deal with:
text/html
text/plain
message/delivery-status, which is actually also plain text
multipart/mixed
multipart/related
multipart/alternative
The non-multipart sections I return as is. With mixed, related and alternative, I merely call get_parts on that MIME node and returns the results. Because alternative is special, it has some extra code after calling get_parts. It will only return html if it has an html part, or it will return only the text part of it has a text part. If it has neither, it won't return anything valid.
The advantage with the hash of valid content types is that I can easily add logic for more parts as needed. And by the time you get_parts is done, you should have an array of all content you care about.
One more item I should mention. As a part of this, we created a separate domain that actually serves these messages. The main domain that an admin works on will refuse to serve the message and redirect the browser to our user content domain. This second domain will only serve user content. This is to help the browser properly sandbox the content away from our main domain. See same origin policy (http://en.wikipedia.org/wiki/Same_origin_policy)
It doesn't sound like a difficult job to me:
use Email::MIME;
my $parsed = Email::MIME->new($message);
my #parts = $parsed->parts; # These will be Email::MIME objects, too.
print <<EOF;
<html><head><title>!</title></head><body>
EOF
for my $part (#parts) {
my $content_type = $parsed->content_type;
if ($content_type eq "text/plain") {
print "<pre>", $part->body (), "</pre>\n";
}
elsif ($content_type eq "text/html") {
print $part->body ();
}
# Handle some more cases here
}
print <<EOF;
</body></html>
EOF
Reuse existing complete software. The MHonArc mail-to-HTML converter has excellent MIME support.

parsing email message

Just want a basic understand of what parts a email message may have.
I know there is a messageId, date, subject, from, cc, bcc, body, etc.
Specifically I want to know how attachments and images may be embedded in the email.
At this point I think there are 2, please correct me if I am wrong.
attachments
embedded attachments/images
is that correct?
The official answer for this question is contained in RFC5322 and some related RFC's. The Wikipedia entry for email does a pretty good job of referencing the RFC numbers. To get started with MIME see RFC2045.
Attachments are encoded as multipart similar to multipart file uploads. Basically the message has a header saying there is an attachment and sets a boundary ( random string of characters to announce the start of the attachment) The boundary says when the data of the attachment starts. I think the filename is set on the boundary as well (if i remember correctly). I am doing a bit of hand waving, but this is the basic idea.
so you get somthing like
To: ...
From: ...
Content-Type: Multpart...
Content-Boundry: ewafoiuasfjasdfoashiafhj
message here
--------- Content-boundry: ewafoiuasfjasdfoashiafhj
attachement here