where i can get blacklist spam email domain dataset? - email

I wanna create an email classifier
the classifier will be divide by email, subject, content classifier
for email classifier, I need a list of blacklist domain such as #blablabla.com #cacacaca.com etc.
like this set here
but I need an up to date domain, so where I can get them? thanks

I wonder if a good way might be to go to mxtoolbox, do a blacklist test, then get a list of blacklist sites and see if you can contact them to get a list?
I suspect that such companies may consider those datasets their intellectual property and probably won't publish these - it may not be possible.
Good luck!
Also Akismet may have such a dataset?
Additionally, the more powerful email classifying software works by using patterns that you can make. Check out MailMarshall88 for example. You could use this to build your own dataset, but remember that just because someone is on a blacklist today, doesn't mean that they're always bad. For example, you might get a virus outbreak in your company which spams people and gets your IP blacklisted. You then fix the virus and are now incorrectly blacklisted. In this scenario a pattern would work much better.

Related

How to identify newsletters programmatically

Are there any headers in email that help to identify newsletters?
I want to categorize mails as personal, newsletters, spam, and promotions.
Is there any code that can do it?
I want a non-machine learning approach to this question.
Lightweight content analysis will do.
There are various headers which can be used to identify mailing lists, but the problem as a whole is somewhat of a heuristic field. Here are some things to try:
Common mailing list software packages have their own headers. Even if they are not explicit, you can rather quickly gather a collection of Majordomo, Listserv, Mailman, Yahoo Groups (bletch) etc. lists and find header patterns which are typical, if not standardized.
Common and uncommon mailing lists increasingly support the various List-Xxx: headers. See further http://www.list-unsubscribe.com/
Back in the day, many mailing lists would set Precedence: list. Tangentially, see also http://cr.yp.to/immhf.html
Do note that many spammers have adopted some or all of these practices -- pesky mainsleaze spammers tend to use well-established email software just like anybody else in the business; it's just that they are less discriminate about who they add to the mailing list in the first place.
All things counted, I would not dismiss a machine learning approach, if only to help you build a decision tree (not all machine learning is bayesian filtering, you know!)

IM (Instant Message) Address Standard

Dear stack overflowers,
I am not sure if this is the best place for this question, but I figured I'd give it a shot.
I am currently working on an API that will allow consumers to read/write data about users. i.e. name, emails, phoneNumbers, etc. And, as you could guess by the title, I am also storing ims.
Since users may contain multiple im addresses that belong to different services (e.g. skype, google talk, AIM, etc.), there is a type attribute associated with each im address.
I am at the point where I am attempting to validate the user attributes, and when I arrived to ims I was unable to find a formal specification, or normative document that dictates how these should be formatted/validated.
My question is the following:
Is there a general format that im URI's follow?
*note:*I have stumbled upon RFC 3861 that touches on im addresses. But it seems like this isn't a standard. Additionally, there is only one example here that has the following format:
im:fred#example.com
Since emails are effectively unique identifiers, it seems reasonable that they could be represented in this way.
Could anyone shed light on this?
After looking in several sites, I was unable to find a standard that applies to all IM providers. I even looked in some API documentation (Yahoo and Jabber) without any luck. If anyone else finds anything that leads them to think any different, please share the knowledge. But as for right now, it appears I am out of luck...

Less is more - auto ZIP code?

You have an international website with a form where people fill in their address.
Wouldn't it be great if people need to fill out one field less? Example:
100 visitors use the form each day
They spend 5 seconds on the ZIP code field
So 5 * 100 * 365 = 182500 seconds or 50 hours a year. And that's just for one form on one website. Multiply that by all websites that ask such information and you can see the time we can save by redesigning this.
You can get someone's ZIP code via geolocation + geocoding. But since a person's current position can easily differ from the city a person lives in, this isn't really usable.
A solution would be to get the ZIP code based on a geolocated (but changeable) country, input city and input street.
The API we could use: http://code.google.com/intl/nl/apis/maps/documentation/geocoding/ or http://developer.yahoo.com/geo/placefinder/.
Now the real question is, which problems would arise (internationalization, localization, accuracy, etc.)?
No-one else has answered this, so I'll have a go.
No, it wouldn't be great if the website filled in the zip code field based on other information. It might work for some people. It would certainly fail for enough people that you'd have to offer a zip code field as an override. Now you have a site with a higher complexity and development cost than one with a conventional zip code field, because you have to test both the automatic zip code guesser and the conventional field.
You'll have a usability hit which comes from people being confused by the two alternatives and not knowing which to choose.
You'll pay an opportunity cost, by spending design and development resources on the zip code guesser, instead of on some other feature which yields a larger usability benefit.
Here are some problems I foresee arising:
Inaccuracy: whatever mechanism you use collects correct hints (IP address location, street address and city) but generates the wrong zip code, due to errors
Remote use: Users entering a different address than their current location, e.g. using a computer at a hotel in a different country to fill out a form related to their home address, so location of IP address of computer is different from location of address in form
Localisation failure: whatever mechanism you use doesn't work with the hints of the user's address, e.g. different address conventions in a foreign country
Provider business terms: you want to use a geocoding service like Google's or Yahoo's APIs, but the license agreement for that service isn't compatible with the business model of your site. For example, they want you to pay if you are geocoding for commercial purposes, or for a site behind a firewall, or more than a certain number of transactions a day
Change in provider situation: you use an external geocoding service, and it goes out of business
etc.
Before taking on a feature like this, I'd take two steps:
User research. Can you identify users for whom the time taken to enter a zip code is a pain point? What about the one of the top three pain points? I'll bet this issue isn't even on your users radar.
Test on existing data. For whatever method you are thinking about using to guess zip code, try it on existing customer data, and see if you can accurately reproduce the zip code the customer entered. This will give you an idea of your error rate. Can you live with this error rate?
If your real question is, could someone please validate my feeling that this is a charming feature, then I probably haven't given you the answer you seek. But you asked, "what problems would arise?"

Whitelist for player to player chat in kids game

We're developing an educational multiplayer game for kids and want to allow players to chat with each other using a whitelist system. When using whitelist chat, players will be able to type only words which appear in the whitelist.
We're aware of the limitations of whitelists in general, but we think a whitelist chat system is something that would allow our players to express themselves better in the game, while allowing a higher level of security than moderated or blacklist chat.
While the system is easy enough to implement, we haven't been able to find a sample whitelist of "safe" words online. Does anyone know of where we can find such a list, preferably with a license that allows us to use it in a commercial project?
Thanks.
I do not believe that a simple whitelist of words will cut it. There are quite a few euphemisms for a lot of stuff out there, that a whitelist would never block (e.g. "he is growing like a weed" is fine, "he is growing weed" is NOT). And let's not mention the basic "would you like to meet?" which would be fine if the meeting were to happen in-game, but very dangerous if it were to happen out of it. Then there is also the issue of blocking rare, foreign or mistyped words, that might make your chat system frustrating enough that it would not be used.
In my opinion, there is absolutely no way you could ever match the security offered by an active and competent human moderator. Of course, depending on the volume of chat traffic and any real-time requirements there are quite a few practical issues with using humans for this. Considering that your application is targeted at children, however, human moderation might be quite acceptable, despite its much higer cost.
A second choise, but one very far from the abilities of human moderation, is to use some statistical filter such as Bogofilter, which will happily sort arbitrary text if you train it well. A blacklist would also help to immediately cut down messages with words that little kids should not (but usually do) know. You would also need a bunch of filters that would cut down messages with stuff like telephone numbers, email and street addresses and web links.
Perhaps the method with the best effectiveness/cost ratio would be to use human moderators assisted by multiple statistical filters to better make use of their time. Keep in mind, however, that if there are malicious users (i.e. anything else than same-age kids in a classroom) there is no way to make sure that nothing questionable or dangerous ever goes through.
You can try the standard unix dictionary. /usr/share/dict/words. But you'll have to modify it to remove the naughty words.
http://en.wikipedia.org/wiki/Words_%28Unix%29
http://www.openwall.com/wordlists/
While this doesn't exactly answer your question, Runescape uses a white list of phrases, rather than words.
The implementation in Runescape is awkward, because there are so many phrases to choose from. You have to go through 3 or 4 menus sometimes to get to the phrase you want.
If you can come up with a better organization of phrases, then this might work for you.

Split testing transactional emails

I'm trying to figure out a solution to manage our transaction emails (such as the welcome email, you've got a bid, etc...)
We would like to be able to allow marketing to manage the content of the emails, and create split tests to test content / subject lines / etc...
Ideally we could invent our own success metrics to report back to the email management system (such as user completed registration, accepted bid, etc...).
Right now we have our emails in templates using stringtemplate. The code replaces tokens with the correct content for that email.
Strongmail is a potential solution, but it is pricey - anybody have experience with alternatives?
I'm looking for the same kind of service, and https://www.sendwithus.com/ seems to do the job.
Have you taken a look at PostageApp?
Currently, it's a layer between your web app and your SMTP server which has additional features for your transactional emails.
With PostageApp, you are able to create two different templates and have them triggered alternately with different content and subject lines. However, the metrics that you would want to use for A/B testing aren't built into the system yet, so I'm not sure if it would be a good fit for you.
Full Disclosure: I work for The Working Group, the company that created PostageApp.
But if you do have questions about what we can help you with and what we can't, definitely let me know and I can answer plenty of questions for you!
Try http://www.cakemail.com/
It is a third party, you design your workflows and give them your contacts.
I work for a 6 million a year website company and we direct all our clients to them, so far so good, everyone is happy.
You have to contact them to have a price but you can get a free account for testing