Mahout Log Likelihood similarity metric behaviour - distance

The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).
Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.
Question
First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.
Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!
Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).
Thanks in advance fellas

You are asking for 50 items similar to some item X. Then you look for 9 similar items for each of those 50. And most of them are the same. Why is that surprising? Similar items ought to be similar to the same other items.
What's a "centered" sum? ranking by sum rather than average still gives you a relatively similar output if the number of items in the sum for each calculation is roughly similar.
What problem are you trying to solve? Because none of this seems to have a bearing on the recommender system you describe that you're using and works. Log-likelihood similarity is not even based on ratings.

Related

Qliksense: Compute median of grouped data

I'm facing an issue in QlikSense, trying to compute some statistical indicators (Percentiles, Quartiles, StdDev, Median etc.) on a dataset which is already grouped by the source.
I mean that my dataset is something similar to the following, in which I have for each combination of Week and Customer Age the total number of purchases:
I want to show the median of Customer Age, and due to the structure of the dataset I can't use fractile or median built-in functions, since they would come out with something different.
Let's suppose I want to calculate the median age of people for all the 3 weeks, so that I want to know what's the age of people who have done the 50% of my purchases.
To let you better understand the question, I show you the histogram:
In this case, the median I want to get is 24-26 years, since the 50% of the total population falls under that range.
I found a useful reference here, but I am having troubles in writing this formula in QlikSense
https://mba-lectures.com/statistics/descriptive-statistics/603/relationship-between-quartiles-decile...
Thanks a lot in advance.
[EDIT]: This is my Data Model View:
[EDIT 2]: Here is my qvf with a dataset more similar to the original one I'm using. As you can see, I can't get the correct result using your formula. In addition, I would like to use it in order to plot the trend of the median through weeks, but it doesn't seem to be possible (Even if I use the modified version of the formula I pointed out in the comments).
If you want to calculate median in such a scenario you need to weighted median and basically check which dimension value is in the middle:
Aggr(
If(
(Rangesum(
Above([# Purchases],0,RowNo())
)
/Sum(TOTAL [# Purchases]))>=0.5
and
(Rangesum(
Above([# Purchases],1,RowNo()-1))
/Sum(TOTAL [# Purchases]))<0.5
,[Customer Age])
,[Customer Age])

Tableau Dual Axis with different filters

I am trying to create a graph with two lines, with two filters from the same dimension.
I have a dimension which has 20+ values. I'd like one line to show data based on just one of the selected values and the other line to show a line excluding that same value.
I've tried the following:
-Creating a duplicate/copy dimension and filtering the original one with the first, and the copy with the 2nd. When I do this, the graphic disappears.
-Creating a calculated field that tries to split the measures up. This isn't letting me track the count.
I want this on the same axis; the best I've been able to do is create two sheets, one with the first filter and one with the 2nd, and stack them in a dashboard.
My end user wants the lines in the same visual, otherwise I'd be happy with the dashboard approach. Right now, though, I'd also like to know how to do this.
It is a little hard to tell exactly what you want to achieve, but the problem with filtering is common.
The principle that is important is that Tableau will filter the whole dataset by row. So duplicating the dimension you want to filter won't help as the filter on the original dimension will also filter the corresponding rows in the second dimension. Any solution has to be clever enough to work around this issue.
One solution is to build two new dimensions that use a calculation rather than a filter to create the new result. Let's say you have a dimension, [size] that has a range of numbers from 1 to 10 and you want to compare the total number of rows including and excluding the number 5. You could create a new field using a formula like if [size] <> 5 then 1 else 0 end
Summing the new field will give a count of the number of rows that don't contain a 5 and this can be compared directly to a rowcount of the original [size] field which will give the number including the value 5.
This basic principle can be extended to much more complex logic. The essential point is to realise that filters act on every row in your data and can't, by themselves, show comparisons with alternative filter choices on a single visualisation.
Depending on the nature of your problem there may be other solutions worth looking at including sets and groups but you would need to provide more specific details for users here to tell you whether they would be useful.
We can make a a set out of the values of the dimension and then place it in the required shelf. So, you will have your dimension which will plot accordingly and set which will have data as per the requirement because with filter you can't have that independence of showing data everytime you want.

Recall, Recall rate#k and precision in top-k recommendation

According to authors in 1, 2, and 3, Recall is the percentage of relevant items selected out of all the relevant items in the repository, while Precision is the percentage of relevant items out of those items selected by the query.
Therefore, assuming user U gets a top-k recommended list of items, they would be something like:
Recall= (Relevant_Items_Recommended in top-k) / (Relevant_Items)
Precision= (Relevant_Items_Recommended in top-k) / (k_Items_Recommended)
Until that part everything is clear but I do not understand the difference between them and Recall rate#k. How would be the formula to compute recall rate#k?
Finally, I received an explanation from Prof. Yuri Malheiros (paper 1). Althougth recall rate#k as cited in papers cited in the questions seemed to be the normal recall metrics but applied into a top-k, they are not the same. This metric is also used in paper 2, paper 3 and paper 3
The recall rate#k is a percentage that depends on the tests made, i.e., the number of recommendations and each recommendation is a list of items, some items will be correct and some not. If we made 50 different recommendations, let us call it R (regardless of the number of items for each recommendation), to calculate the recall rate is necessary to look at each of the 50 recommendations. If, for each recommendation, at least one recommended item is correct, you can increment a value, in this case, let us call it N. In order to calculate the recall rate#R, it is neccesary to make the N/R.

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.
Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%
I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?
So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.
I was able to get the percentiles by using Windows Functions and apply ntile() and cumeDist() over the window. The ntile() can create grouping based off of an input number. If you want things grouped by 10%, just enter ntile(10), if by 5% then ntile(20). For a more fine-tuned restult, cumeDist() applied over the window will output a new column with the cumulative distribution, and those can be filtered from there through select(), where(), or a SQL query.

tf-idf - accessing a large sparse scipy matrix & getting the highest values

For the tfidf result matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to still get the high values for the tfidf, which could include words with low tf. One idea I looked up is doing something like tf_idf_matrix.sum(axis=0), which would sum up the columns. This works in my code, but because of 113k columns, print wont show them all. If I could use something like argsort() to access the top K column sum values, that would be helpful.
This question stems off my original question which is here.
The reason is that I want to know which words are the ones I should look at closer, and not necessarily the ones that have the highest frequency. I would also like to know about the "anomalies" that is, words that might not appear in all or many documents/posts but could have a high tfidf in a one or fewer documents. In case there are other approaches I should consider, I wanted to explain this.
Thanks