I am working on a project, where I need to identify emails sent by real humans as opposed to bulk mails, notifications and newsletters. Is there any definite way of doing that? Is there any information in email header which can help. I am working on top of Gmail IMAP so I already have non-spam emails.
Any help in this regard is appreciated. Thanks!
There isn't a clear way to distinguish bulk mail from personalised mailings. Unlike with spam, most bulk mail is requested/expected, so the sender doesn't do odd things to get round spam filters, which means these emails often blend in fairly well.
However, there are some trends that you can look for. If you want to do it reliably, you will probably need to apply some scoring system, like spam-filters do.
You will also need to accept that you are bound to get a substantial proportion of false positives and false negatives.
Some things that are common to bulk mail that appear less often in personalised correspondence:
"To" and "Cc" addresses do not contain a local recipient. Sometimes the sender will send to "mailList#mydomain.com" instead of "recipientA#recipientAdomain.com", "recipientB#recipientBdomain.com", etc. In these cases, it is also likely that only one address appears in "To" and nothing appears in "Cc"
"From" address is "noreply#", "newsletter#", "do-not-reply#", "mailinglist#", even less common terms like "support#" or "sales#" (but remember, they could cause false positives)
The presence of a "List-Unsubscribe:" header
The message contains an unsubscribe link. Run pattern matching to find common phrases in the final few lines of the email. Look for links, or words such as "unsubscribe", "opt out", etc.
Mailing lists tend to have rich content. Check for heavy use of CSS and lots of images, the entire message being contained within a <table></table> or <ul><li></li></ul> structure. i.e. the stuff that something like Dreamweaver would put in, rather than a mail client.
Headers or bold content at the top of the message. If the first bit of a message resembles a newsletter, it's probably a newsletter.
Lots of links or frequent linking to the same (or same few) websites. Newsletters will try to guide the user to the company's site(s), as much as they can. You may score this even more highly if the linked domain matches (or resembles) the sender domain.
Heavy references to social media. If it's a newsletter containing several articles, each story may have its own "Tweet this", "Like this" link. Personal users are likely to contain (at most) one reference to Twitter, Facebook, etc (in their signature)
Notifications and other auto-generated messages will often follow the same basic format. If you have the capabilities, run some kind of diffing or other comparison against previous messages. A strong match would imply automation.
There is no greeting, or a generic greeting. However, personal emails will often skip the "Dear Fred" bit too, so this isn't a good enough detection by itself; but things like "Dear User" or "Dear Customer" are almost certainly generic.
Unlikely to end in "Regards, Ian" or "Yours Sincerely, John Doe"
The sender has scored highly before. Keep a record. If a sender triggers a high score several times, they are almost certainly bulk mailing.
Related
Over the past few months random email addresses, some of which are on known spam lists, have been added at the rate of 2 or 3 a day to my website.
I know they aren't real humans - for a start the website is in a very narrow geographical area, and many of these emails are clearly from a different country, others are info# addresses that appear to have been harvested from a website, rather than something a human would use to sign up to a site.
What I can't work out is, what are reasons for somebody doing this? I can't see any benefit to an external party beyond being vaguely destructive. (I don't want to link to the site here, it's just a textbox where you enter email and press join).
These emails are never verified - my question isn't about how to prevent this, but what are some valid reasons why somebody might do this. I think it's important to understand why malicious users do what they do.
This is probably a list bombing attack, which is definitely not valid. The only valid use I can think of is for security research, and that's a corner case.
List bomb
I suspect this is part of a list bombing attack, which is when somebody uses a tool or service to maliciously sign up a victim for as much junk email as possible. I work in anti-spam and have seen victims' perspectives on this: it's nearly all opt-in verifications, meaning the damage is only one per service. It sounds like you're in the Confirmed Opt-In (COI) camp, so congratulations, it could be worse.
We don't have good solutions for list bombing. There are too many problems to entertain a global database of hashed emails that have recently opted into lists (so list maintainers could look up an address, conclude it's being bombed, and refuse to invite). A global database of hashed emails opting out of bulk mail (like the US Do Not Call list or the now-defunct Blue Frog's Do Not Intrude registry but without the controversial DDoS-the-spammers portion) could theoretically work in this capacity, though there'd still be a lot of hurdles to clear.
At the moment, the best thing you can do is to rate-limit (which this attacker is savvy enough to avoid) and use captchas. You can measure your success based on the click rate of the links in your COI emails; if it's still low, you still have a problem.
In your particular case, asking the user to identify a region via drop-down, with no default, may give you an easy way to reject subscriptions or trigger more complex captchas.
If you're interested in a more research-driven approach, you could try to fingerprint the subscription requests and see if you can identify the tool (if it's client-run, and I believe most are) or the service (if it's cloud-run, in which case you can hopefully just blacklist a few CIDR ranges instead). Pay attention to requesters' HTTP headers, especially the referer. Browser fingerprinting it its own arms race; take a gander at the EFF's Panopticlick or Brian Kreb's piece on AntiDetect.
Security research
The only valid case I can consider, whose validity is debatable, is that of security research (which is my field). When I'm given a possible phishing link, I'm going to anonymize it. This means I'll enter fake data rather than reveal my source. I'd never intentionally go after a subscription mechanism (at least with an email I don't control), but I suppose automation could accidentally stumble into such a thing.
You can avoid that by requiring POST requests to subscribe. No (well-designed) subscription mechanism should accept GET requests or action links without parameters (though there are plenty that do). No (well-designed) web crawler, for search or archiving or security, should generate POST requests, at least without several controls to ensure it's acceptable (such as already concluding that it's a bad actor's site). I'm going to be generous and not call out any security vendors that I know do this.
Lots of people seem to recommend hidden preheader text. For those who don't know, preheader text is a way to control the e-mail content preview, like this:
An example of hiding it would be:
<div style="display:none;font-size:1px;color:#333333;line-height:1px;max-height:0px;max-width:0px;opacity:0;overflow:hidden;">
Wishing you a safe and merry holiday season!
</div>
I'm wondering if hiding this preheader text from humans reading the e-mail might increase spam score or impact deliverability? In the world of web crawlers, hiding content from users but not machines (so-called cloaking) is a big no-no, and it can really hurt you.
Does anyone know if spam checkers might employ similar logic? I've seen some conjecture online, but not much in the way of solid references. Any anecdotes, quotes, or links on this topic would be helpful.
Short answer: yes, it can.
More detailed answer: add hidden text is exactly what spammers do, to bypass spam filters. Every spam filter can detect this hidden-zero-height-zero-width text and - depends on its configuration - will take that into account for the spam score calculation. It's certainly not a single spam marker, but with all the rest of your email, that might brings you over the threshold.
SpamAssassin can be configured to detect that.
I understand, that this is great for marketing purposes, but to get my stuff delivered, I would rather avoid it.
Carsten is right, hiding text in your email and playing with the email client behaviour to get it displayed will increase the spaminess of your email.
I have been working as a developer on antispam filter for several years, hiding text is a very common spammer technique. It is used to outline several words (or even letters) of a block of random text to display the spammer's message, making it harder for antispams filters to identify a common pattern.
Same way, hiding text between a subject header and the first mime part (another kind of preheader) is a common spammer technique.
Alone, these criterias may not be enough to get your message blocked. But added to other spam criterias (ex : if you message has already been identified as a mailing list), that may give you some bad surprises.
I got a request from my client that they want to add stars (★) to their email subject (They send these mails through the application we made as a part of bigger CRM for them).
I tried to send a test mail, and the email title is displayed nicely in my Gmail account, and I must agree with my client that it is eye catching, but what came to my mind is that this may be a spam magnet, so I googled about it but I can't find the actual "don't do this".
Generaly, my oppinion would be not to use it, but now I have to explain to the client why. My best explanation whould be there is a probability your emails will be treated as spam but I don't have the background for this statement.
Do you have any suggestions about what should I do?
The only information I could find is on the SpamAssassin page of how to avoid false positives. The only relevant part I found was this part.
Do not use "cute" spellings, Don't S.P.A.C.E out your words, don't put
str#nge |etters 0r characters into your emails.
SpamAssassin is a very widely used spam filtering tool. However, simply breaking one of the rules (strange characters) alone wouldn't get an email marked as spam. But combined with some other problems could lead to your email being considered spam. That being said, if your email is a completely legitimate business email, it's likely that few other rules are triggered, and using the special characters wouldn't create a huge problem. That being said, you should probably try out a couple test emails on SpamAssassin and a couple other spam filtering tools in order to come to a better conclusion on the emails you plan to send out.
Simply explain to your client as you have explained to SO: you stated that the star made it eye catching: this doesn't directly mean that it will be treated as spam, but you could explain how that concept COULD be considered spam.
If the star is part of their branding, however, this could be quite a nice way in which your client expresses themselves.
Spam emails are becoming more and more like what one would consider 'normal', so I think they have trial it internally, test the concept.
Talk it over with your client - there is going to be no basis in hard fact with things like this, purely social perception.
More and more retailers are using unicode symbols in their subject lines since a few months. Of course it's in order to gain more attention in cluttered inboxes. Until now, there has been absolutely no evidence that such symbols increase the likelihood of failing spam filter tests. However, keep in mind that rare symbols might not render (correctly) across all mail user agents. Especially keep an eye on Android and Blackberry smartphones, but also on Outlook. In addition, due to a Hotmail bug symbols will render much bigger in subect lines and in the email body within the web front end. In fact, they are beeing replaced by images. All in all, the star shouldn't make any problems. At least, if it's encoded correctly in the subject line. So, go for it.
I'm doing a small newsletter software for my business, and I'm wondering what metrics should I collect. Obviously, bounces and clicks should be tracked, but I'm wondering should I track email opens (via an image or a bgsound element)?
Do popular webmail services and ISPs check for tracking images and possibly increase my spam score? I guess if it increases the chances of ending up in a spam filter, I'd rather not collect that metric.
Thanks.
It's generally bad form to try to track your users in that way. Email opens are a private thing.
If you have an image with a unique URL per message, yes you can track it, but IMO, you really shouldn't. Including unnecessary images in an email is bad for a number of reasons:
Images can increase your spam score. There's a time and place for images. They can improve a message, but used inappropriately, they can look spammy.
It is obvious what you are doing. Sooner or later, one of your customers is likely to get wise to it. Some people won't care; others will feel violated.
It's REALLY unreliable. Most email clients and webmails feature an option to block images by default. You will get massively understated results.
Also remember, some people open an email immediately before they click the "delete" button. You are much better off tracking clicks.
There may be some merit in tracking the images you want to include anyway, but I'd not treat it as anything more than a very basic indicator.
As always it depends on the individual ISP and Webmail services. However, I can share some anecdotal evidence: I periodically use mailchimp to send out mass email notifications, and email opens are tracked in mailchimp using the same approach you mentioned (See following link for reference: http://kb.mailchimp.com/reports/about-open-tracking). I never experienced any issues with ending up in the spam folder, I have only had challenges with bounce backs and ending up in the Gmail promotions tab.
So based on the fact that some companies are already doing this, I have to say it probably won't increase your spam score.
Aside from the visual splendor of HTML emails - links are the only thing keeping me from sending plain text emails. They are much simpler for users at times and reduce bandwidth by over 50%. However, forcing my users to copy/paste or (* shiver *) type the URL from the plain text email is not acceptable.
However, it seems like many services such as gmail and hotmail are converting URLs into HTML links. If that's true, then for some lighter emails I could finally switch to plain text (in certain cases) without bothering anyone.
Anyone know what percentage (or what systems or clients) convert text URLs into clickable links?
Some users access via the web (Hotmail/Yahoo/Gmail) while others use clients (Outlook/Thunderbird).
All email progams I know make links clickable, web-based and normal ones.
You should consider putting the links at the end of the mail, and use "[number]" to refer to them:
You should really visit PEAR[1] and friends, PHP[2]!
[1] http://pear.php.net/
[2] http://www.php.net/
That frees you from problems with longer URLs within the text, and it keeps the text readable.
You must not forget one simple fact and that is every mail client make this configurable. So I for one have the option of reading / sending html emails but I don't do that. So it's totally irrelevant how many mail clients support this, the relevant question is how many users have this enabled.