Are there any headers in email that help to identify newsletters?
I want to categorize mails as personal, newsletters, spam, and promotions.
Is there any code that can do it?
I want a non-machine learning approach to this question.
Lightweight content analysis will do.
There are various headers which can be used to identify mailing lists, but the problem as a whole is somewhat of a heuristic field. Here are some things to try:
Common mailing list software packages have their own headers. Even if they are not explicit, you can rather quickly gather a collection of Majordomo, Listserv, Mailman, Yahoo Groups (bletch) etc. lists and find header patterns which are typical, if not standardized.
Common and uncommon mailing lists increasingly support the various List-Xxx: headers. See further http://www.list-unsubscribe.com/
Back in the day, many mailing lists would set Precedence: list. Tangentially, see also http://cr.yp.to/immhf.html
Do note that many spammers have adopted some or all of these practices -- pesky mainsleaze spammers tend to use well-established email software just like anybody else in the business; it's just that they are less discriminate about who they add to the mailing list in the first place.
All things counted, I would not dismiss a machine learning approach, if only to help you build a decision tree (not all machine learning is bayesian filtering, you know!)
Related
I wanna create an email classifier
the classifier will be divide by email, subject, content classifier
for email classifier, I need a list of blacklist domain such as #blablabla.com #cacacaca.com etc.
like this set here
but I need an up to date domain, so where I can get them? thanks
I wonder if a good way might be to go to mxtoolbox, do a blacklist test, then get a list of blacklist sites and see if you can contact them to get a list?
I suspect that such companies may consider those datasets their intellectual property and probably won't publish these - it may not be possible.
Good luck!
Also Akismet may have such a dataset?
Additionally, the more powerful email classifying software works by using patterns that you can make. Check out MailMarshall88 for example. You could use this to build your own dataset, but remember that just because someone is on a blacklist today, doesn't mean that they're always bad. For example, you might get a virus outbreak in your company which spams people and gets your IP blacklisted. You then fix the virus and are now incorrectly blacklisted. In this scenario a pattern would work much better.
I have built a recommender systems which has tens of thousands of items and their feature descriptions, but no user profiles as of now. I am looking for pointers to approaches that can help me bootstrap the system, so I can do some evaluation. I would appreciate any pointers to papers/applications that have addressed this problem.
How to deal with the cold-start problem depends a lot on your specific application.
An easy way of dealing with the user cold-start problem is to present the new user with random items, or the most popular items, or hand-selected items, and start learning from them.
Another way is to present users with a questionnaire, and then present items to them according to the results. Or you directly show them items/products and let them rate/select the ones they like.
Also note that in web-based system you usually know some things about your users: Which operating system/browser they use, where they (roughly) come from, which language they speak.
All this information can be used.
Papers:
see the Wikipedia article on the topic
My answer to another question on StackOverflow lists some papers for dealing with new items - most of the methods would also be applicable to new users.
Another approach is to select products/items that will help you most for learning about the user. Just out of my head, you can find them by querying Google Scholar for "recommendation" and the terms "decision trees", "active learning", "user cold-start", and so on.
We're developing an educational multiplayer game for kids and want to allow players to chat with each other using a whitelist system. When using whitelist chat, players will be able to type only words which appear in the whitelist.
We're aware of the limitations of whitelists in general, but we think a whitelist chat system is something that would allow our players to express themselves better in the game, while allowing a higher level of security than moderated or blacklist chat.
While the system is easy enough to implement, we haven't been able to find a sample whitelist of "safe" words online. Does anyone know of where we can find such a list, preferably with a license that allows us to use it in a commercial project?
Thanks.
I do not believe that a simple whitelist of words will cut it. There are quite a few euphemisms for a lot of stuff out there, that a whitelist would never block (e.g. "he is growing like a weed" is fine, "he is growing weed" is NOT). And let's not mention the basic "would you like to meet?" which would be fine if the meeting were to happen in-game, but very dangerous if it were to happen out of it. Then there is also the issue of blocking rare, foreign or mistyped words, that might make your chat system frustrating enough that it would not be used.
In my opinion, there is absolutely no way you could ever match the security offered by an active and competent human moderator. Of course, depending on the volume of chat traffic and any real-time requirements there are quite a few practical issues with using humans for this. Considering that your application is targeted at children, however, human moderation might be quite acceptable, despite its much higer cost.
A second choise, but one very far from the abilities of human moderation, is to use some statistical filter such as Bogofilter, which will happily sort arbitrary text if you train it well. A blacklist would also help to immediately cut down messages with words that little kids should not (but usually do) know. You would also need a bunch of filters that would cut down messages with stuff like telephone numbers, email and street addresses and web links.
Perhaps the method with the best effectiveness/cost ratio would be to use human moderators assisted by multiple statistical filters to better make use of their time. Keep in mind, however, that if there are malicious users (i.e. anything else than same-age kids in a classroom) there is no way to make sure that nothing questionable or dangerous ever goes through.
You can try the standard unix dictionary. /usr/share/dict/words. But you'll have to modify it to remove the naughty words.
http://en.wikipedia.org/wiki/Words_%28Unix%29
http://www.openwall.com/wordlists/
While this doesn't exactly answer your question, Runescape uses a white list of phrases, rather than words.
The implementation in Runescape is awkward, because there are so many phrases to choose from. You have to go through 3 or 4 menus sometimes to get to the phrase you want.
If you can come up with a better organization of phrases, then this might work for you.
I've written some simple software which helps me manage and disseminate engineering data on a company intranet. It's pretty flexible about adapting to new content and I wonder if it justifies the description 'Content Management System.
A previous question: how to define content management did a pretty good job of defining a CMS, but I've a feeling my approach fails to reach the bar.
What is the minimum set of features considered essential in a Content Management System, and are there names for subsets of these features?
For example, I've seen some software described as a 'dashboard'. Is this a subset of a CMS?
I'm not really interested in testimonials for other CMS solutions.
It's a bit like Jazz, if you have to ask it's ain't ...
To my mind discussions about such terminology tend to be in the Marketing space. If your software is doing something useful, who cares what it is, or more to the point what label you put on the tin?
Came across a simple definition from a text from what you could possible consider an 'other CMS solution', but we web-frameworkers tend to have bizarre views on CMSs.
Content management systems (CMS)
let users create and edit pages on a
site dynamically through a web-based
interface. Sometimes called
brochureware site because they tend to be used in the same fashion as
traditional printed brochers handed
out by businesses.
Practical Django Projects, 1st ed. James Bennetts
http://www.apress.com/book/preview/9781590599969
Not the final answer, but one definition.
There are two ways to look at it. What is the name: "Content Management System". You could argue that if it is a system to manage content, it's a content management system (small letters). The other way to look at is user expectation. What does a test group of representative users or developers in your target audience expect when they hear CMS? Editing the textual content of a website comes to mind in this case.
If you want to provide a description useful to a broader audience, you have to understand their expectations. If your own interpretation is that those expectations would be unfulfilled, you might come up with a more specific label. Perhaps Engineering Data Management System, or something more specific to your purpose. I think you will be much happier with this.
Lastly, if you need to categorize it on some form of public resource website, you might have to go up or laterally from an existing CMS category. Or, use the category, but a more specific label for the product itself.
I am interested in choosing a good structure for an online message board-type application. I will use SO as an example, as I think it's an example that we are all familiar with, but my question is more general; it is about how to achieve the right balance between organization and flexibility in online message boards.
The questions page is a load of random stuff. It moves quickly (some might say, too quickly) and contains a huge number of questions that I'm not interested in.
The idea, I imagine, is that we can use tags to find questions that we're interested in. However, I'm not sure that this works: you can't use tags negatively. I'm not interested in PHP or perl or web development. I want to exclude such posts. But with the tags, I can't.
Although discrete subforums are in a sense less flexible, as they generally force you to pick a category even if a question might fit into two (if SO had, say, areas for "Web Development", "Games development", "Computer Science", "Systems Programming", "Databases", etc. then sure, some people might want to post about developing of web-based games, for example) is it worth sacrificing some of that flexibility in order to make it easier to find the content that you are interested in, and hide the content that you are not interested in?
Is there any way with a pure tagging system to achieve the greater ease of use that subforums provide?
The real problem with subforums comes when you guess wrong about which topics have enough interest to get their own subforums. While some topics end up with their own vibrant subcommunities others end up as empty ghettos, with little activity or feeling of community. Topics that might flourish as occasional subjects in a larger forum end up fragmented among many subforums, none of which has the critical mass of people necessary to have an active, vibrant community.
Though I think that tagging is supperior to grouping, people tend to think hierarchically.
In general it depends on the target group for the forum.
Maybe you can go with a mixture: use tagging and later use tag groups to order to posts. Delicious uses this, for example, and I find it rather helpful.
If you're worried about the divide between specific forums and open tag-based systems, like Stack Overflow, consider making a query system that allows you to do a bit more complex queries than just the AND operator, like here on Stack Overflow.
I cannot make a query here that will give me all questions in .NET, SQL or C#, combined, and that is the biggest irritation I have with the tags. With such a query system, you can create virtual forums at least.
Other than that, I don't really have a good opinion. I like both, and I haven't yet decided which one is best.
The idea, I imagine, is that we can use tags to find questions that we're interested in. However, I'm not sure that this works: you can't use tags negatively. I'm not interested in PHP or perl or web development. I want to exclude such posts. But with the tags, I can't.
While it's currently the case that you can't use tags to hide content, it shouldn't be impossible. Using SO as an example again, there's no reason that a system similar to the ignore function on a forum couldn't be made for the tag system. By adding a right-click context menu or a small "X" link somewhere in the tag display, tags could be marked as ignored. This would also allow the current tag feature to function; Seeing everything (minus your ignore list), or clicking a tag to see only questions with that tag.
Ignored tags could be managed in your profile if you should later develop an interest in PHP or INTERCAL that you lacked before.
The real question is that of performance. In my head it's as simple as replacing a SELECT [stuff] WHERE Tag = 'buffer-overflow' with SELECT [stuff] WHERE Tag NOT IN ('php','offtopic','funny-hat-friday') but I've not put together any DB backed sites that get absolutely pounded on by thousands people.