How Gmail spam filter works?

How Gmail spam filter works? - email

I'm always surprised by the high quality of Gmail spam filter. For the last year, it filtered 99.95% of the spam, and blocked by mistake only one mail. By comparison, any other mail service I used makes at least one mistake for every 50 mails.
How, internally, Gmail does to reach this level of quality? Is it based on customers feedback (ie. if N customers block mail as spam, it is sorted as spam for every other customer)? Or there is some trick? Maybe a basic filter algorithm filters the most obvious spam, and some difficult cases are analyzed by real humans?

Briefly speaking this is based on the community feedback. Here is a citation from official explanation:
Gmail users play an important role in keeping spammy messages out of millions of inboxes. When the Gmail community votes with their clicks to report a particular email as spam, our system quickly learns to start blocking similar messages. The more spam the community marks, the smarter our system becomes.
You can read a bit more about it on their Spam Explained page.

This is the million dollar question, and if it were able to be answered on stackOverflow, then everyones spam filter would be as effective.

I don't really know how exactly Google does SPAM filtering (but I think it's a business secret after all). If you are interested in how SPAM filtering works, I would recommend looking at Bayesian SPAM filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering). It's a rather easy to understand method.

Google is most likely using a classifier system, such as Logistic Regression or Neural Networks. State of the art spam detection frequently employs Machine Learning algorithms such as these.
The output classification is "Spam" or "Not Spam," and the inputs, I'm sure, are top secret at Google, but I'm sure certain email text phrases such as "Buy Now," "On Sale," "Viagra," or "Male Enhancement" are all factors in their model.

There is no Official release on this, and most of the suggestions are just observations/experts view.
Based on my observations on emails we deliver, here are my findings:
1. User engagement is the key: If users are not engaging in your emails then your emails are bound to be flagged as spam.
Here are some metrics:
- Whom you email, and how often you email them
- Which emails you open
- Which emails you reply to
- Keywords that are in emails you usually read
- Which emails you star, archive, or delete
2. Sender Domain Reputation: What is the past history of the sending domain? If in past the user engagement was higher then probability of the new email from the same domain landing in Inbox is high.
Google is using complex AI and Machine learning algorithms to make this happen. While you might get some success by changing the IP, domain or return-path, but all that will be a very short term hacks.

Related

spam issues with sending millions of emails

I am currently developing an email server in C, and the end goal is to be able to send millions of emails to millions of people every day. Many organizations have email lists with large numbers of users that they email every week/month/etc.
The big question: how can I prevent the server and the emails from being marked as a spam? All of the SPAM-prevention stuff I've seen so far deals mostly with poor configurations, or at least does not require large numbers of emails to be send every hour. I have yet to see anything that addresses the scope of millions-of-emails-per-hour.
Here are some assumptions you can make:
EVERY single email sent is legitimate
all SPF records and MX records are accurate, up-to-date, and valid
all other common SPAM-prevention tactics are being used (reverse DNS is good, DKIM is used, return-addresses are valid, etc etc etc)
emails are one-to-one (ie, I'm not CC'ing 1000 gmail addresses; I'm sending one email to each address)
Here are some questions to get us moving in the right direction:
should I limit the number of emails sent to X emails per minute per domain? If so, how do sites like GMail and MailChimp get around this? note: there are no ISP restrictions; this is only an issue for the receiving mail server...
should I limit the number of connections to a domain at a given time? (eg, will Google think I'm a spam agent if I open 10/100/1000 simultaneous connections to gmail servers?)
how many bounce-backs (5xx errors on an address) should I accept for automatically removing that email from a subscription list? does this affect a server's spam rating?
is there anything else I should or should not do?
Final note: please remember this is a programming question, NOT a library question - I don't want to use someone else's service; we are writing our own for a reason. I'm looking for practical programming advice.

This is not a programming question, but here goes:
I strongly recommend you join your local mail operators mailing list, as well as "Spam-L" mailing list. Read the archives, and see what issues others are having.
The short answer is that destination servers can, and do, use all sorts of methods to try to prevent spam. THere are many things you will need to be aware of in order to have good deliverability, and those things change all the time.
First and most important, remember:
Free speech also includes free listening.
Nobody has to accept or transmit your mail.
Independent operators, businesses and individuals have a perfect right to refuse your mail for any reason or no reason. ISPs are limited only by their contracts with the customer and common-carrier laws, which generally give them broad discretion in what is considered spam and how they block it.
Their system, their rules. If you want your messages delivered, you must cooperate with receiving ISPs. This may mean jumping through hoops, or complying with requirements you think are stupid, or pointless.
Ensure you are not listed by SpamHaus. Most ISPs small and large use SpamHaus DNSBL service. Presence on one of SpamHaus' lists asserts their opinion that your mail meets their listing criteria. Because of SpamHaus' high reputation, most ISPs will simply block all mail you send based on their opinion.
Make sure you process unsubscribes.
Make sure you process non-delivery reports. You may not want to kill a subscription on the first NDR, as there can be intermittent network or server problems which can result in non-delivery, or even erroneous reports that an address is incorrect. But if you get several over the course of a month or two with no successful deliveries, you should kill the subscription.
Join a pay-for reputation service. These may require posting a bond which you may lose if you send Spam. SpamHaus offer one. There are others.
Get professional advice from someone like Return-Path. You will have to pay for this also.
Monitor. The hoops you have to jump through change all the time. Ensure you are aware of emerging deliverability problems.
Join feedback loops. most large ISPs offer feedback programmes where you can get feedback on how users are perceiving your mail, whether they are reporting it as spam, etc.

Ben had some good practical advice, but for others with this problem, here is what I have discovered in the past month:
Email is all about REPUTATION. You will never be able to throw together a server, ip, and/or domain name and expect to be able to send out millions upon millions of emails.
On Stack Overflow, we have a rating system (up and downvotes) to estimate the value/trust that person has with the SO community. But it takes time and effort to get points. It's the same with email - you have to start sending out small amounts of email that people actually open up and read (and would never mark as spam), and then slowly send out more and more every month until you reach the goal of millions and millions of emails.
Everytime someone "downvotes" - marks the email as spam, flags the domain, flags the ip address, deletes the email without reading it, etc - you get a hit against your reputation. You need to be continually monitoring and putting effort and best-practices into your reputation if you want to gain good standing with people.
So start small, expand in a stable and steady manner, and always keep a watchful eye out for abuse, misuses, good and bad feedback, or anything else that might affect your reputation.
It's not only possible, but very practical; you just need to give it time and effort.

What's the best approach for writing an app that sends continuos mails

I'm writing an app that sends mails constantly, in general to different users. Like alerts or reminders. Do I need to take care about how many mails I send? Check time intervals between mails?
I currently have a domain, and I'm sending mails via SMTP. I don't want to enter any black lists or anything.

MailChimp is a startup that manages email newsletters; their livelihood depends on them not getting blacklisted or classified as spam.
You might learn some good approaches by reading the guide they wrote on How To Avoid Spam Filters. Here's the basic premise:
Unfortunately, there's not a quick fix. The only way to avoid spam filters is to understand what spam is and how the filters work.

Email deliverability - Influencing factors [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
[Our website] is very dependent on being able to successfully send email to its members. We are currently having trouble reaching all our members, especially hotmail users.
What do you recommend we do to improve our sending of email?
We are sending heavily user customized emails. So a third party solution would need a good api
to support this.
Possible solutions:
Would sending email through the app
engine help for delivery rates?
Does returnpath help? http://www.returnpath.net/
Update:
Some good comments on how to improve and test our own email sending capabilities. Another option would be a third party solution.
We're sending updates on your networks activities, registration emails, new comment emails, new follower emails these type of things. Especially your networks activity is highly individual and problematic with most third party emailing solutions. Would need a very flexible email solution.
Are there sufficiently capable solutions out there?

We had a similar issue a while back.. You probably want to read up on Microsoft's Sender ID:
https://www.microsoft.com/mscorp/safety/technologies/senderid/default.mspx
and look at the link called "Sender ID SPF Record Submission Form".

Postmark and Sendgrid seem to offer a very decent api to use for sending email and improving deliverability. As a bonus stats are also handled by them.

1 ) using shared ip
2 ) sending more than 1000 email per hours may cause Spam
3 ) sending from root server without SMTP login may cause this problem
4 ) contents email has links from websites blocked from RBL ( Real Time Blacklist )
5 ) ...

I'd in general be pessimistic about the IP reputation of platform-as-a-service type offerings. Testing Google AppEngine is on my to-do list, but I've there's been much talk about Amazon EC2 presenting a real problem -- these products are not very efficient preventing use by spammers, and reputation is taking a hit.
As for the practical steps of setting up outbound email, Jeff Atwood has a very nice and nearly comprehensive article on his blog.
What I'd certainly suggest is:
Make sure your sending IP has a reverse DNS.
Check your IP reputation for example at senderscore.org (though that's heavily US centric)
Make sure bounces are handled on your side, and postmaster#your.domain is reachable
Set up SPF and DKIM. SenderID if you want to.
Sign up for all feedback loops at major mailbox providers / ISPs and act on spam complaints -- if your user complain, you're doing something wrong. Also, set a "friendly name" on your From: address, as some mailboxes will only display the local part -- " Update" is friendlier than only seeing "automatic" (Gmail does this).
Watch the volume you send. If it's high from the start (>1000s/day to each major ISP) you may get blocked outright.
You'll find a lot of deliverability tips, most of the time from interested parties (email service providers). A relatively reputable resource is deliverability.com, backed in part by Return Path. Of course, going with a commercial email service provider might be a solution for you, but your use case is quite specific and you'll need real-time individual messaging, not marketing newsletters, if I understand you right.
I worked for a company that re-sold Return Path's tool -- so take this with a pinch of salt: It' won't help you get delivered. It can, however, be a valuable tool tracking down where your problems are. It is on the other hand expensive, and hiring a specialist that can go through your specific case might be more affordable. Or reading a lot and experimenting a lot yourself.

#chryss does a great job pointing out the important factors that need to be taken into consideration:
-- reverse DNS, sender reputation, list management (ie, cleaning lists of addresses who have marked your email spam, invalid addresses, etc and keeping track of hard and soft bounces and acting accordingly to those events), SPF records, DKIM signatures, ISP feedback loops, ISP rate limits. Also, email content is important to keep in mind.
Generally speaking, this is all pretty complicated and annoying stuff to deal with, especially as your email volume increases.
In terms of IP reputation with PaaS systems, the key thing to remember is this:
-- if you share an IP with someone who earns a poor reputation (say, a spammer on EC2), that reputation will negatively affect your deliverability. On the other hand, if you send from a dedicated IP, you have the opportunity to earn your own reputation - if you are a good sender, follow best practices, and your customers want the emails they expect to receive from you (which they should since it sounds like you are sending mostly transactional emails), you will maintain a great reputation and should enjoy good deliverability (granted all of the technical stuff mentioned above is taken care of).
We generally keep an eye on deliverability "chatter" online, and send out all the cool/useful stuff that we find on a daily basis through our twitter feed -- feel free to follow us: twitter.com/sendgrid. We are also beginning to ramp up our own blogging, so you can join the conversation if you like: blog.sendgrid.com.

If you want a comprehensive solution without having to do a lot of troubleshooting/fact-finding, just check out SendForensics.com. Disclaimer: I am affiliated with the company.
Regards,
Russ

Deliverability issues generally happen if there is something wrong with any or all of the following elements:
Email Content
Server Configuration
Email address and Domain reputation
IP address reputation
More information here

Risks in sending out high volume of emails over SMTP

What are the risks, if any, of sending out massive amounts of emails over SMTP? Specifically, this question is regarding the risks of being labelled/blacklisted as spammers of spoofers.
Our mails are legitimate, however. Our system needs to send out reminders to our corporate users on a daily basis, which may number into the thousands, say. Our worry is that with such a setup, our domain might end up being blacklisted by the receiving organisation, thus rendering our reminder service useless.
Does anyone have any information on what might be a "safe" volume of emails to send out to avoid being blacklisted? Or can we just churn out emails with abandon?

You may be able to contract a third-party organization to take care of this for you. I know there's a lot of "direct marketing" companies that will let you use their API to send mass email (newsletters, etc). They can do the work of negotiating to get off blacklists - that's what you pay them for.
I haven't used Sendloop and don't know if it has the functionality you want, but it's probably a good example.

See: How to conduct legitimate email campaigns
In your reminder service, just follow some basic spam guidelines. Identify where the message came from, why the user got it, the link to "opt-out" or discontinue the reminders, and you'll be fine. Any blacklists you do get on will certainly remove you if you have this information in your messages.
Additionally, should you get blacklisted for some reason, have another server on a different network that you can use as a backup should your primary server get blacklisted temporarily for any reason.
Oh, and one final note - usually your entire "domain" (i.e. whatever.com) doesn't get blacklisted. Specific IP addresses or specific servers are usually what get blacklisted.

As long as you're mailing over clean IPs and domains you should be fine. You say your mailings are "legitimate" so there's no reason to worry about ISPs blocking you.
However, as you also mentioned, the volume can become a challenge. Broadly speaking, sending "thousands" of messages should be a non-issue. But... hundreds of thousands, say 250K messages a day on up, is when you start to qualify as a "high-volume" sender.
Once you start sending at this bulk level, you must run a tight ship. ISP filters will look for any clue that you're a black-hat mailer/spammer and will promptly block your deployment if anything looks off.
Make sure your list(s) are spic-and-span; all bounces, duplicates, typos and honey-pots have been scrubbed-out. Your IPs have been properly warmed-up, your DNS and domains are clean and properly registered and you remain responsive to your list recipients.
Basic common sense and following through on all the tiny, simple but crucial details goes a long way.

Guidelines for email newsletter service

I'm implementing a email newsletter sender service using .NET and Windows Server technologies. Are there comprehensive guidelines which could help avoiding emails being trapped by spam filters and other mechanisms?
They should cover all aspects of (legal) bulk mail sending: SMTP configuration, DNS, HTML content, images, links within content etc. A simple example: is it better to embed images or load them from a server?
It would be great if you could provide some empirical data to show the efficiency of some measures taken.

Although I don't have a definitive answer, I think this is a very important question.
Here are few tidbits I know about it
Choose a clean hosting/smtp server. IP addresses of spamming SMTP servers are often black-listed by other ISPs.
Send a simple introductory email to every subscriber, asking them to add your sender address to their safe list.
Be very prudent in sending to only those people who are actually expecting it. You wouldn't want pattern recognizers of spam filters learning the smell of your content.
If you don't know your smtp servers in advance, its a good practice to provide configuration options in your application for controlling batch sizes and delay between batches. Some servers don't like large batches or continuous activity.

Unless you have a very specific reason to host the newsletter yourself, I think you'd be much better off using a third party service. There are lots out there, and some are very cheaply priced.
It'll save you on development work
(no point in re-inventing the
wheel).
Their system will handle all
the unsubscribe link stuff that you
need to include in email newsletters
to comply with CAN SPAM laws or
whatever.
They handle the spam
reports that you will inevitably get
if you have a list of any non-trivial size.
They keep records of who signed up,
how they signed up, and their IP
address, and can present those on
receipt of a spam report to prove
that their service wasn't sending
out spam.
You can use double-opt in
(or confirmed opt in), for extra
evidence to prove that the people
you're sending emails to actually
signed up to receive them.
If you really do need to host it yourself I'd suggest you search the web for "email deliverability". Things that are known to help include properly set up SPF records, DomainKeys/DKIM, correct DNS settings (reverse DNS especially - best to just use an online service to check your DNS settings). You can test a lot of these things by sending an email to check-auth#verifier.port25.com.
It's best to avoid using spammy words in your email - always a bit of guesswork this but you some words can trip filters.
But I'd guess that by far the most important thing is to be sending your email from a trusted server that maintains good relationships with ISPs (i.e. ensuring that ISPs don't think that the server is sending out spam). This is a big reason why it's much much easier to get a third party to handle everything for you.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse