Is SpamAssassin sensitive against UTF-8 words? - email

let me explain with a simple example:
An email with Order in its subject will trigger SPAM filters such as SpamAssassin.
I want to know if an email subject contains that words translation in Arabic: أطلب or in Persian: سفارش, Does it draw attentions too?

Related

Email SPAM and mismatched URLs: are query strings allowed?

Many email clients will flag your email as SPAM if the anchor text in your links are pretending to be a link to a different place. For example:
https://yourbank.com/verify_account
But do they do the same thing for query strings? For example:
https://yourbank.com/verify_account
The link is not pretending to be something else, but it does include some query strings which are not exactly the same as the anchor text.

Cleaning Emails for Custom Email System

I have noticed the various email clients prepend/append text to the text written by the user. For example, Gmail seems to prepend the following text to all email bodies:
"On Tue, Jul 14, 2015 at 11:41 AM, Jonny Bravo wrote: >"
The added text differs based on the client. I am not interested in this information. I would like to be able to extract the message body from the text with an approach that is relatively cross-platform. Does anything like this exist? Is the best solution to clean the text on a case-by-case basis?
We had the same problem at mailparser.io when we developed our "last reply" filter. We get very decent results by just checking against a set of regular expression.
The regular expressions we use are:
'/^(--)$/ms', // -- Signature break
'/^(-----(.+))$/ms', // ----- reply above
'/^(From:(.+))$/ms', // From:
'/^(On\s(.+)wrote:)$/ms', // On DATE, NAME <EMAIL> wrote:
'/^(Sent from(.+))$/ms', // Sent from (iPhone / iPad / Windows Mail ...)
With those you should actually catch most cases produced by e-mail clients which have their language set to English.

How can I be sure an email address is unique?

There's a pub in my town whereby, if you sign up to their newsletter using their website and provide a "unique" email address, you get a free drink. On a whim, I decided to sign up a second time using myemail+one#gmail.com. It let me. I'm now sitting on a nice comfy pile of free drink vouchers.
This got me thinking about a system we have here, where the email address is considered the unique identifier. Checking the code, sure enough, if we were offering vouchers in our business, someone else would be sitting pretty.
The basic, stab-in-the-dark, fix is to check for the "+" character and ignore everything after it (up to the #), and compare using that. But I am unsure if this was the intent for the + character. Would that work?
Secondly, are there any other caveats that would allow a user to sign up multiple times with a seemingly different email address, but which actually would always end up in the same mailbox?
This question is language-agnostic.
While using a plus sign as an e-mail address alias is a known feature of gmail, other mailers do either not allow it or use a minus sign instead. '+' is a legitimate character to be used as part of an email address according to the RFC.
The use of '.' is also a gray area. john.doe#gmail.com and johndoe#gmail.com send also both to the same email address and look different.
In order to validate the uniqueness of an email address you will have to prepare a rule base for your application, keep it up to date and still expect surprises...

separate email from original email using perl

When people email each other, they generally include the original email in their reply to a sender, adding a little more information each time to the email. Each email client seems to have a different way of adding the original email to a reply.
I need to parse email arriving at our mail server and try and extract the new part of the message, and I'm wondering if there is a sensible way to strip this appended (or prepended) information (the "original message") and just get the new information in a mail body? I believe sadly, that there is no encoding, the original email is simply added to the new message, but I thought I'd check with the experts?
thanks.
No, there is no simple, straightforward algorithm to separate quoted or forwarded text from new content. Quoting and forwarding are poorly standardized and different conventions have existed at different times.
Having said that, e.g. Google's Gmail succeeds fairly well in practice. With enough samples, you can clearly come up with reasonable heuristics.
Good indicators for quoted material are forwarded (pseudo-) headers and indented text, perhaps with a quote indicator along the left margin before the quoted text. You occasionally see outdents as well.
Traditionally, on Usenet in the early 1990s, people would use different, unique quoting styles.
: ~ | This seems to be the original.
: ~ This is the first reply.
: This is the second reply.
This is the third reply, quoting the
previous three messages in sequence.
Around 1995, both clients and standardization initiatives by and large converged on "wedge" quotes;
> >> This seems to be the original.
> > This is the first reply.
> This is the second reply.
This is the third reply, quoting the
previous three messages in sequence.
Then along came Microsoft and ruined it all. I suppose that top quoting makes sense in some corporate settings where you quickly need to collect all the background from a thread to a new participant, but even for that purpose it's a horrible abomination.
This is the third reply, quoting the
previous three messages in sequence.
---- Begin forwarded message ----
From: Him [smtp:bogus]
To: His Friend
Subject: VS: Re: Same as on this message
Date: nothing machine-readable
This is the second reply.
---- Alkuperäinen viesti ----
Lähettäjä: His Friend [smtp:poppycock]
Saaja: Some Guy
Aihe: Re: Same as on this message
Päivämäärä: olisiko eilen ehkä
This is the first reply.
----- Original message ----
From: Somebody Else [smtp:mindless]
To: Some Guy
Subject: Same as on this message
Date: like, the day before
This seems to be the original.

parsing email message

Just want a basic understand of what parts a email message may have.
I know there is a messageId, date, subject, from, cc, bcc, body, etc.
Specifically I want to know how attachments and images may be embedded in the email.
At this point I think there are 2, please correct me if I am wrong.
attachments
embedded attachments/images
is that correct?
The official answer for this question is contained in RFC5322 and some related RFC's. The Wikipedia entry for email does a pretty good job of referencing the RFC numbers. To get started with MIME see RFC2045.
Attachments are encoded as multipart similar to multipart file uploads. Basically the message has a header saying there is an attachment and sets a boundary ( random string of characters to announce the start of the attachment) The boundary says when the data of the attachment starts. I think the filename is set on the boundary as well (if i remember correctly). I am doing a bit of hand waving, but this is the basic idea.
so you get somthing like
To: ...
From: ...
Content-Type: Multpart...
Content-Boundry: ewafoiuasfjasdfoashiafhj
message here
--------- Content-boundry: ewafoiuasfjasdfoashiafhj
attachement here