What's a good way to "weight" the importance of item/user features in LightFM? - lightfm

I have been using LightFM in an e-commerce application with a decent amount success. Thanks for this really cool package!
I'm in the process of optimizing the recommendation outputs. Because of specific business domain considerations, certain features are "more important" and have less leeway in terms of variability than others, like "price" (i.e. certain recommended items need to be in a soft price range).
Is there also a way for LightFM to weight this feature MORE than the others?.
Thanks in advance!

Related

Does OptaPlanner have a "built-in" way to perform multi-unit score normalization?

At the moment, my problem has four metrics. Each of these measures something entirely different (each has different units, a different range, etc.) and each is weighted externally. I am using Drools for scoring.
I only have only one score level (SimpleLongScore) and I have to find a way to appropriately combine the individual scores of these metrics onto one long value
The most significant problem at the moment is that the range of values for the metrics can be wildly different.
So if, for example, after a move the score of a metric with a small possible range improves by, say, 10%, that could be completely dwarfed by an alternate move which improves the metric with a larger range's score by only 1% because OptaPlanner only considers the actual score value rather than the possible range of values and how changes affect them proportionally (to my knowledge).
So, is there a way to handle this cleanly which is already part of OptaPlanner that I cannot find?
Is the only feasible solution to implement Pareto scoring? Because that seems like a hack-y nightmare.
So far I have code/math to compute the best-possible and worst-possible scores for a metric that I access from within the Drools and then I can compute where in that range a move puts us, but this also feel quite hack-y and will cause issues with incremental scoring if we want to scale non-linearly within that range.
I keep coming back to thinking I should just just bite the bullet and implement Pareto scoring.
Thanks!
Take a look at #ConstraintConfiguration and #ConstraintWeight in the docs.
Also take a look at the chapter "explaning the score", which can exactly tell you which constraint had which score impact on the best solution found.
If, however, you need pareto optimization, so you need multiple best solutions that don't dominate each other, then know that OptaPlanner doesn't support that yet, but I know of 2 cases that implemented it in OptaPlanner by hacking BestSolutionRecaller.
That being said, 99% of the cases that think of pareto optimization, are 100% happy with #ConstraintWeight instead, because users don't want multiple best solutions (except during simulations), they just want one in production.

What does the impact search annotation do in MiniZinc?

In MiniZinc it is possible to use the search annotation impact, it is explained as follows on the official website:
annotation impact
Choose the variable with the highest impact so far during the search
What does this mean in practice? What is the highest impact? How is this calculated?
To understand the impact based variable selection, you have to understand first_fail. In constraint programming we generally want to solve the hardest sub-problem first, failing quickly if no solution can be found. The problem with first_fail is that it doesn't take into account the number of constraints that a variable is involved in, more would indicate that the a decision for the variable "harder", or the effect that choices on the variable had in other parts of the search-tree.
As a sidenote, dom_w_deg is can be seen as compromise between first_fail and impact, where the constraints are taken into account, but the past decision are not.
impact variable selection is supposed to be an improvement on first_fail where not just domain sizes are considered, but also the constraints it's involved in and how much "impact" historical choices had. The variable with the highest impact is the one that is expected to be the hardest to assign the right value, taking all of this information into account.
As you've seen, MiniZinc does not provide an exact specification of how the variable choice has to made. It is up to solver implementer to select a heuristic that fit the solver. Note that it would be hard to provide an exact heuristic guideline as it would heavily depend on how the solver tracks its variables and constraints.
For ideas on possible implementations of impact based heuristics, I would suggest reading the paper "On the Efficiency of Impact Based Heuristics" by Marco Correia and Pedro Barahona. You can also check your specific MiniZinc/FlatZinc solver for their implementation of the heuristic.

Clustering Category Purchases in Customer Data

I am attempting to cluster a group of customers based on spend, order frequency, order breadth and what % of purchases they make in each category (there are around 20).
It will probably be a simple answer but I cannot figure out whether I should standardize (subtract mean and divide by sd) the % category buy columns or not. When I dont standardize I can get around 90% of the variance explained in 4-5 principal components (using SVD), but when I standardize each column I only get around 40% for the same number of principal components. My worry is that because each column is related, I am removing the relationship by standardizing. At the same time I am worried that not standardizing will cause issues with the other variables in the data that I have standardized.
I would assume if others tried clustering in this way they would face a similar issue but I cant seem to find one so it might be that I just dont understand the situation. Thanks for any clarification in advance!
Chris,
Percentage scale has a well defined range and nice properties.
By heuristically scaling these features you usually make things worse.

how can I set a goal on recommendation system ?(mean average precision, baselineRmse)

I starting to develop offline recommendation system using ALS algorithm.
and I need to set a goal about system.
so I wanna know what criteria used to evaluate recommendation system.
I already know MAP (mean average precision) and improvement to baselineRmse and I would like to know: how is the performance on these criterions in modern recommendation systems to set my goal.
Back in the early days of recommenders people thought predicting ratings was a good idea. This has since proven to be nearly useless of itself. If you have enough space in a UI to show a few recommendations are you going to pick the one you think the user will pick with the highest ratings? That will always result in bad performance. Rating prediction is what RMSE was designed to measure.
MAP#k on the other hand is meant to find the predictiveness in a recommender. It measures how well the training data predicts what is in the test data. It also accounts for the ordering of recommendations. Ranking/ordering of recommendations has more recently been discovered to have a much greater effect on the effectiveness of recommendations because if you can only show a limited number they had better be the most likely to cause a user to take action.
MAP#k also takes account of ranking in the sense that if you measure MAP#1 and MAP#10, you will see decreasing MAP scores if your first recommendation was more likely to be in the test data than the 10th. This means you are ordering recommendations roughly correct.
For these reason we use MAP#k. Split the "gold standard" dataset you will use in later rests and keep the split static—something like 80%-20% will work split by random choice or by time, the most recent 20% used as the test split. Build you model on the 80%, then for each interaction in the 20% get recommendations and see if the recommendations contain the item actually interacted with in the test set. The aggregate of all these will go into the MAP#k calculation, k is based on how many recommendation you ask for.
See these references and some tools we have to do this:
Kaggle blog references python code they and we ActionML use. https://www.kaggle.com/wiki/MeanAveragePrecision
ActionML analysis python code to split data sets and run MAP#k, where we use the Kaggle function. https://github.com/actionml/analysis-tools

Clustering or classification?

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.
I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item "Apple launches new iphone", I need to associate the company Apple with it. A particular news item/document only contains 'title' and 'description' so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.
To solve this, I turned to Mahout.
I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', 'shares', 'street', 'olympics' and lots of other terms as the top ones (which makes sense as clustering algos' look for common terms). Although there were some 'Apple' clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).
I started reading about classification which requires training data, The name was convincing too as I actually want to 'classify' my news items into 'company names'. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name 'Apple' then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)
You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says 'iphone sales at a record high' without using the word 'Apple', the system can classify it as a news related to apple?
Thank you in advance for pointing me in the right direction.
Copying my reply from the mailing list:
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as "about Apple" or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a classifier that can tell when an article is "about
Apple".
I don't think it will quite work to automatically generate the
training set by labeling according to the simple rule, that it is
about Apple if 'Apple' is in the title. Well, if you do that, then
there is no point in training a classifier. You can make a trivial
classifier that achieves 100% accuracy on your test set by just
checking if 'Apple' is in the title! Yes, you are right, this gains
you nothing.
Clearly you want to learn something subtler from the classifier, so
that an article titled "Apple juice shown to reduce risk of dementia"
isn't classified as about the company. You'd really need to feed it
hand-classified documents.
That's the bad news, but, sure you can certainly train N classifiers
for N topics this way.
Classifiers put items into a class or not. They are not the same as
regression techniques which predict a continuous value for an input.
They're related but distinct.
Clustering has the advantage of being unsupervised. You don't need
labels. However the resulting clusters are not guaranteed to match up
to your notion of article topics. You may see a cluster that has a lot
of Apple articles, some about the iPod, but also some about Samsung
and laptops in general. I don't think this is the best tool for your
problem.
First of all, you don't need Mahout. 3000 documents is close to nothing. Revisit Mahout when you hit a million. I've been processing 100.000 images on a single computer, so you really can skip the overhead of Mahout for now.
What you are trying to do sounds like classification to me. Because you have predefined classes.
A clustering algorithm is unsupervised. It will (unless you overfit the parameters) likely break Apple into "iPad/iPhone" and "Macbook". Or on the other hand, it may merge Apple and Google, as they are closely related (much more than, say, Apple and Ford).
Yes, you need training data, that reflects the structure that you want to measure. There is other structure (e.g. iPhones being not the same as Macbooks, and Google, Facebook and Apple being more similar companies than Kellogs, Ford and Apple). If you want a company level of structure, you need training data at this level of detail.