How does distributed checksum works? - text-processing

I am looking for information as to how distributed checksum applications like Vipul's razor, Pyzor or DCC works? I have a similar requirement where I can use such a distributed checksum feature in my program.
So I am looking for some documentation explaining the algorithms behind distributed checksums.
with regards,
raj

Pyzor is an implementation of "hash sharing". Quote from SpamAssassin's Wiki:
One of the approaches used to identify spam goes like this; if I see a spam message at 8:30 in the morning, I send a checksum of that message to an online database of spam. When you get that message a little later on in the morning, your mail system asks that online database, "Has anyone reported this as spam?". The online database can report back "yes", allowing your mail system to raise the spam score for that message.
Depending on what you're trying to do with distributed hashes, DHTs (Distributed hash tables) might also be interesting for you.

Related

How to check the deliverability of my outgoing email (spam, dkim, dmarc, spf)

The project I'm working on is a newsletter builder, and I'm on its final steps. Now I need to verify spf dkim and dmark (which I don't know what they mean or how they work). Then I also need to check if my email is considered as spam or if any of the news contains spam (separately). I tried to read the documentation of 2 great spamcheckers (spamassassin and rspamd) and I couldn't understand anything about how they are supposed to be integrated on my project. I think all my problems are due to my lack of knowledge related to emails/email servers and stuff related. I'd really appreciate if someone could enlight me about what are the steps that I need to do, if I really need to setup an email server to test this out and how to do it etc. etc. I'm really in the dark here. I know the enterprise I'm doing this work for already was sending emails from their domain but I don't think they gave me access to that.
The following link may be useful to you, it's a document of iRedMail (an open source mail server solution):
https://docs.iredmail.org/setup.dns.html
You don't need to know what iRedMail is, just check the introduction of each dns record.
For me, these introductions are enough, if you want a more detailed introduction, wikipedia and official website may be more useful
For checking spam status and dns records such as spf, dkim, etc., setting spamassassin or rspamd by yourself may be complicated, but there are many free services available.
I often use the following (I have my own mail server, so I sometimes use these services for testing):
https://www.mail-tester.com/
https://mxtoolbox.com/

Tool to inspect and parse an inbox

I have a inbox that receives lots of emails from my various systems and such as Nagios and Azure alerts regarding disk usage, exceptions and job failures.
Since I get so many of these alerts, I was wondering if there was any tools that I could use to filter out and only receive the most important alerts - a lot of these are spam that are only affecting my development environments and so I only want to be alerted only when something goes wrong with my production environment.
Does anyone know of any such tools or knows a better way of dealing with this sort a problem? There must be a better solution to this rather than manually reading through all my emails and checking the contents.
I've heard of a tool known as LogRythm but I'm under the impression that this is purely a data security tool and am unsure whether it would be able to parse an inbox.
Thanks all in advance.
A good solution is IMAPfilter. This is a utility for Linux systems (but I run it on Windows with the Linux Subsystem) that uses the IMAP protocol to manage one or more mailboxes.
You need to define all your filters and actions using the Lua language (but it's pretty simple, in the GitHub there are many examples) and keep the program running all the time (if the program is not running, all mails will arrive unfiltered, and they will be reorganized once the program is started again)
This is not a pretty software (no GUI, weird language, needs an always on server to have real time filtering), but since you can program your filters you can do pretty much anything, and it's also very light

Email deliverability - Influencing factors [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
[Our website] is very dependent on being able to successfully send email to its members. We are currently having trouble reaching all our members, especially hotmail users.
What do you recommend we do to improve our sending of email?
We are sending heavily user customized emails. So a third party solution would need a good api
to support this.
Possible solutions:
Would sending email through the app
engine help for delivery rates?
Does returnpath help? http://www.returnpath.net/
Update:
Some good comments on how to improve and test our own email sending capabilities. Another option would be a third party solution.
We're sending updates on your networks activities, registration emails, new comment emails, new follower emails these type of things. Especially your networks activity is highly individual and problematic with most third party emailing solutions. Would need a very flexible email solution.
Are there sufficiently capable solutions out there?
We had a similar issue a while back.. You probably want to read up on Microsoft's Sender ID:
https://www.microsoft.com/mscorp/safety/technologies/senderid/default.mspx
and look at the link called "Sender ID SPF Record Submission Form".
Postmark and Sendgrid seem to offer a very decent api to use for sending email and improving deliverability. As a bonus stats are also handled by them.
1 ) using shared ip
2 ) sending more than 1000 email per hours may cause Spam
3 ) sending from root server without SMTP login may cause this problem
4 ) contents email has links from websites blocked from RBL ( Real Time Blacklist )
5 ) ...
I'd in general be pessimistic about the IP reputation of platform-as-a-service type offerings. Testing Google AppEngine is on my to-do list, but I've there's been much talk about Amazon EC2 presenting a real problem -- these products are not very efficient preventing use by spammers, and reputation is taking a hit.
As for the practical steps of setting up outbound email, Jeff Atwood has a very nice and nearly comprehensive article on his blog.
What I'd certainly suggest is:
Make sure your sending IP has a reverse DNS.
Check your IP reputation for example at senderscore.org (though that's heavily US centric)
Make sure bounces are handled on your side, and postmaster#your.domain is reachable
Set up SPF and DKIM. SenderID if you want to.
Sign up for all feedback loops at major mailbox providers / ISPs and act on spam complaints -- if your user complain, you're doing something wrong. Also, set a "friendly name" on your From: address, as some mailboxes will only display the local part -- " Update" is friendlier than only seeing "automatic" (Gmail does this).
Watch the volume you send. If it's high from the start (>1000s/day to each major ISP) you may get blocked outright.
You'll find a lot of deliverability tips, most of the time from interested parties (email service providers). A relatively reputable resource is deliverability.com, backed in part by Return Path. Of course, going with a commercial email service provider might be a solution for you, but your use case is quite specific and you'll need real-time individual messaging, not marketing newsletters, if I understand you right.
I worked for a company that re-sold Return Path's tool -- so take this with a pinch of salt: It' won't help you get delivered. It can, however, be a valuable tool tracking down where your problems are. It is on the other hand expensive, and hiring a specialist that can go through your specific case might be more affordable. Or reading a lot and experimenting a lot yourself.
#chryss does a great job pointing out the important factors that need to be taken into consideration:
-- reverse DNS, sender reputation, list management (ie, cleaning lists of addresses who have marked your email spam, invalid addresses, etc and keeping track of hard and soft bounces and acting accordingly to those events), SPF records, DKIM signatures, ISP feedback loops, ISP rate limits. Also, email content is important to keep in mind.
Generally speaking, this is all pretty complicated and annoying stuff to deal with, especially as your email volume increases.
In terms of IP reputation with PaaS systems, the key thing to remember is this:
-- if you share an IP with someone who earns a poor reputation (say, a spammer on EC2), that reputation will negatively affect your deliverability. On the other hand, if you send from a dedicated IP, you have the opportunity to earn your own reputation - if you are a good sender, follow best practices, and your customers want the emails they expect to receive from you (which they should since it sounds like you are sending mostly transactional emails), you will maintain a great reputation and should enjoy good deliverability (granted all of the technical stuff mentioned above is taken care of).
We generally keep an eye on deliverability "chatter" online, and send out all the cool/useful stuff that we find on a daily basis through our twitter feed -- feel free to follow us: twitter.com/sendgrid. We are also beginning to ramp up our own blogging, so you can join the conversation if you like: blog.sendgrid.com.
If you want a comprehensive solution without having to do a lot of troubleshooting/fact-finding, just check out SendForensics.com. Disclaimer: I am affiliated with the company.
Regards,
Russ
Deliverability issues generally happen if there is something wrong with any or all of the following elements:
Email Content
Server Configuration
Email address and Domain reputation
IP address reputation
More information here

How Gmail spam filter works?

I'm always surprised by the high quality of Gmail spam filter. For the last year, it filtered 99.95% of the spam, and blocked by mistake only one mail. By comparison, any other mail service I used makes at least one mistake for every 50 mails.
How, internally, Gmail does to reach this level of quality? Is it based on customers feedback (ie. if N customers block mail as spam, it is sorted as spam for every other customer)? Or there is some trick? Maybe a basic filter algorithm filters the most obvious spam, and some difficult cases are analyzed by real humans?
Briefly speaking this is based on the community feedback. Here is a citation from official explanation:
Gmail users play an important role in keeping spammy messages out of millions of inboxes. When the Gmail community votes with their clicks to report a particular email as spam, our system quickly learns to start blocking similar messages. The more spam the community marks, the smarter our system becomes.
You can read a bit more about it on their Spam Explained page.
This is the million dollar question, and if it were able to be answered on stackOverflow, then everyones spam filter would be as effective.
I don't really know how exactly Google does SPAM filtering (but I think it's a business secret after all). If you are interested in how SPAM filtering works, I would recommend looking at Bayesian SPAM filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering). It's a rather easy to understand method.
Google is most likely using a classifier system, such as Logistic Regression or Neural Networks. State of the art spam detection frequently employs Machine Learning algorithms such as these.
The output classification is "Spam" or "Not Spam," and the inputs, I'm sure, are top secret at Google, but I'm sure certain email text phrases such as "Buy Now," "On Sale," "Viagra," or "Male Enhancement" are all factors in their model.
There is no Official release on this, and most of the suggestions are just observations/experts view.
Based on my observations on emails we deliver, here are my findings:
1. User engagement is the key: If users are not engaging in your emails then your emails are bound to be flagged as spam.
Here are some metrics:
- Whom you email, and how often you email them
- Which emails you open
- Which emails you reply to
- Keywords that are in emails you usually read
- Which emails you star, archive, or delete
2. Sender Domain Reputation: What is the past history of the sending domain? If in past the user engagement was higher then probability of the new email from the same domain landing in Inbox is high.
Google is using complex AI and Machine learning algorithms to make this happen. While you might get some success by changing the IP, domain or return-path, but all that will be a very short term hacks.

Guidelines for email newsletter service

I'm implementing a email newsletter sender service using .NET and Windows Server technologies. Are there comprehensive guidelines which could help avoiding emails being trapped by spam filters and other mechanisms?
They should cover all aspects of (legal) bulk mail sending: SMTP configuration, DNS, HTML content, images, links within content etc. A simple example: is it better to embed images or load them from a server?
It would be great if you could provide some empirical data to show the efficiency of some measures taken.
Although I don't have a definitive answer, I think this is a very important question.
Here are few tidbits I know about it
Choose a clean hosting/smtp server. IP addresses of spamming SMTP servers are often black-listed by other ISPs.
Send a simple introductory email to every subscriber, asking them to add your sender address to their safe list.
Be very prudent in sending to only those people who are actually expecting it. You wouldn't want pattern recognizers of spam filters learning the smell of your content.
If you don't know your smtp servers in advance, its a good practice to provide configuration options in your application for controlling batch sizes and delay between batches. Some servers don't like large batches or continuous activity.
Unless you have a very specific reason to host the newsletter yourself, I think you'd be much better off using a third party service. There are lots out there, and some are very cheaply priced.
It'll save you on development work
(no point in re-inventing the
wheel).
Their system will handle all
the unsubscribe link stuff that you
need to include in email newsletters
to comply with CAN SPAM laws or
whatever.
They handle the spam
reports that you will inevitably get
if you have a list of any non-trivial size.
They keep records of who signed up,
how they signed up, and their IP
address, and can present those on
receipt of a spam report to prove
that their service wasn't sending
out spam.
You can use double-opt in
(or confirmed opt in), for extra
evidence to prove that the people
you're sending emails to actually
signed up to receive them.
If you really do need to host it yourself I'd suggest you search the web for "email deliverability". Things that are known to help include properly set up SPF records, DomainKeys/DKIM, correct DNS settings (reverse DNS especially - best to just use an online service to check your DNS settings). You can test a lot of these things by sending an email to check-auth#verifier.port25.com.
It's best to avoid using spammy words in your email - always a bit of guesswork this but you some words can trip filters.
But I'd guess that by far the most important thing is to be sending your email from a trusted server that maintains good relationships with ISPs (i.e. ensuring that ISPs don't think that the server is sending out spam). This is a big reason why it's much much easier to get a third party to handle everything for you.