Classifying items in more than one category - classification

I am developing a news classification system where a particular news item is assigned to an organization or company name. For instance a news item labelled "Apple to launch new iPhone in september 2012" gets categorized in "Apple" news.
So far, after training the classifier with a bunch of topics such as Apple news, Google news, Microsoft news, Samsung news, Bank of America news etc worked perfect and I was getting almost 99% correctly classified instances from a single trained model.
Now the problem is to classify a news such as "Samsung and Google prep attack against Apple" into three topics, "Apple", "Samsung" and "Google".
My question over here is how can I use Mahouts classification to classify a single item into multiple classes. I saw a similar question in this thread http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3C20120607223156.GA26283#opus.istwok.net%3E.
Ted Dunning gave an interesting answer as to make seperate category for multiple topics, but in my case the combinations are many. I have to classify news into almost 15,000 companies and realistically speaking any news can be a mixture of any of the 15000 companies. So the making of combinations as a separate category is ruled out!.
A second suggestion was to arrange topics in a hierarchy which also does not apply over here as the company names doesn't converge to any base category.
Having 15000 models for 15000 topics would do it, but does not sound very plausible too!
So what should be the correct way for classifiying multi topic news then?
Thanks!

If you are confronted with the problem of multi labeling your data, it is better to use a tool that is meant specifically for it. Currently mahout doesn't support multi labeling (there are some ways to do it but they are like work arounds). Here are a few tools to multi label your data
http://mulan.sourceforge.net/
http://meka.sourceforge.net/

Related

aws-presonalize: can I get recommendations on items not seen in training based on item features?

I consider using aws personalize, or any similar managed recommendation service.
My question is whether it is possible to get recommendations/rankings on items that were not seen in the training data, based on item features. I see that aws personalize does have item feature dataset, but when I read the documentation about ranking recipe it specifically says that items not in the training are added at the end of any ranking. of course - new items have no interaction data, so any recipe/algorithm that solely relies on interaction data is not relevant for my case.
My question is, whether and how can I utilize aws personalize to my use case, if at all possible, or whether you know of any recommender service that can handle it.
Yes. There are specific Amazon Personalize recipes designed to support cold starting items where a cold item is one without behavioral data in the interactions dataset but with item metadata in the items dataset.
The User-Personalization recipe supports cold starting items through a feature called exploration. You control how much exploration (i.e., recommending cold items) is done with the explorationWeight inference hyperparameter when creating a Personalize campaign or batch inference job. See this blog post for details.
Exploration also applies to domain recommenders for the Top picks for you VOD recommender and Recommended for you e-commerce recommender. You specify the explorationWeight when creating a recommender.
The Similar-Items recipe supports the related items use case and looks to balance recommending similar items based on behavioral data and thematic similarity between items. You currently cannot control the weighting with this recipe, though. See this blog post for details. The More like X VOD recommender provides similar functionality.

Difference between MusicGroup and Person in schema.org

I have in a website a directory of musicians, music groups and institutions (luthiers, concert halls, etc...).
According with the official documentation of schema.org about MusicGroup, I could include also a solo musician.
My questions are:
Should I assign "Person" to solo musicians and "MusicGroup" to music groups?
If I assign "MusicGroup" to solo musicians, his/her thumbnail will be displayed in the Google search list as rich snippet? (I guess if I assign "Person" to solo musicians it will be displayed).
The same as question 2. but with music groups.
The same as question 2. but assigning "Organization" to institutions.
I am very interested in show the thumbnail in the search list, but also to give a logical and correct semantic syntaxis.
As the Schema.org documentation says, a "MusicGroup" can refer to a solo musician, and should be used. It has the benefit that you can have "album" and "track" properties that allow you to define the solo musician's recorded works (which are not present in "Person").
Probably, but that's entirely up to how Google decide to display search results for your site.
As per the answer to (2)
As per the answer to (2)
To expand on my answers to your questions 2,3 and 4:
Google only display thumbnails for search results in a few rare cases. My suspicion (Google's decision-making process is not public) is that thumbnails in search results are more dependent on having a very high-ranking website and therefore - in Google's eyes - a highly trustworthy and reliable source of information, rather than on the particular schema.org "type" you use.

New system cold start: Recommender systems

I have built a recommender systems which has tens of thousands of items and their feature descriptions, but no user profiles as of now. I am looking for pointers to approaches that can help me bootstrap the system, so I can do some evaluation. I would appreciate any pointers to papers/applications that have addressed this problem.
How to deal with the cold-start problem depends a lot on your specific application.
An easy way of dealing with the user cold-start problem is to present the new user with random items, or the most popular items, or hand-selected items, and start learning from them.
Another way is to present users with a questionnaire, and then present items to them according to the results. Or you directly show them items/products and let them rate/select the ones they like.
Also note that in web-based system you usually know some things about your users: Which operating system/browser they use, where they (roughly) come from, which language they speak.
All this information can be used.
Papers:
see the Wikipedia article on the topic
My answer to another question on StackOverflow lists some papers for dealing with new items - most of the methods would also be applicable to new users.
Another approach is to select products/items that will help you most for learning about the user. Just out of my head, you can find them by querying Google Scholar for "recommendation" and the terms "decision trees", "active learning", "user cold-start", and so on.

Recommendation engine using google-prediction-api?

In google's prediction api page, it says we can use it for recommendation of webpages / products...
Can someone please show me how, for example:
I have 500,000 members purchased history
I have 2,000,000 products in 200 different categories
I have user-X just signup, asked him 15 'like' / 'dislike' product questions (user's taste)
Now, i want to suggest/recommend user-X with a list(e.g. 500) of products which he most likely willing to purchase
Thanks a lot
If you are not specifically tied to Google API fow whatever reason, explore using Mahout. This is a basic use case for the Mahout Recommendation mining.
https://cwiki.apache.org/MAHOUT/itembased-collaborative-filtering.html
The Google Prediction API, as currently implemented, is great for classifying data into a discrete set of categories, however, as noted in the documentation:
Avoid having a high ratio of categories to training data in categorical models.
Try to have at least a few dozen examples for each category, minimum.
For really good predictions, a few hundred examples per category is
recommended.
The Prediction API's classification doesn't work well when the ratio of categories to examples is high and in the example you sketched the relationship is one-to-one because you are trying to find the user whose liked product list is most similar to the user of interest (to find a set of promising products to recommend). In this model, each user is a unique category.

Is there a well known classifier library?

I'm crawling data from internet,without classifying.
Is there such a library to recommend?
EDIT
I'm crawling jobs from other website,and I need to group them into different industries.
To sort unlabelled data into groups, you want clustering, not classification. The most complete machine learning library is the Java-based Weka. You'll probably want to start by extracting text from the web pages (remove script and style elements completely, strip other tags), and then running the text through the StringToWordVector filter before performing clustering.
My current employer developed a system to categorize web pages. There were not any useful libraries that we could find so we had to do our own. We do not license ours out.
I can give you some hints. Spam analyzers classify email into Junk or Not Junk. You can use the same tools such as Bayesian, CRM-114, etc to do your own classifications on any text, including web pages.
You will have to watch the results of these very carefully and give them a lot of human feedback. You can often find keyword sets that will score very well for you. Finding those keyword sets will take time and effort and it will change some over time.
You will have to write code to divide web pages into topic sections because most pages are not all one thing. There are ad frames, navigation and other things.