Why am I getting a lot of 0 predictions in collaborative filtering using Alternating Least Squares? - pyspark

Here the userCol are sellers and itemCol are products sold by them. The ratingCol is the sales amount for a given seller product combination. The sales amount varies between a few hundreds to millions.
Questions
Should I treat sales as an implicit feedback (implicitPrefs = True)? Currently I am treating it as an explicit feedback and setting it False
When I look at the predictions around 70 percent of the observations get a value of 0. Any reason why this is happening? The records having a non-zero value are pretty accurate though
When I get the predicted values for my test set, I see that the records returned are a lot lesser. Is it because the model is encountering users or items not present in the training set? I had set coldStartStrategy = "drop"
predictions = model.transform(test)
The length of predictions dataframe is significantly lesser than test dataframe

Related

Tableau Weighted Average of Last Value in Date Group over Running Sum Across Extra Level of Detail not in Report

I am an absolute Tableau beginner, so forgive my lack of proper terminology.
Context
To give some context to the problem, think of the dataset as the balances and current interest rates of two different loans for which we are trying to calculate a weighted average cost of funds at any point in time, while retaining the ability to filter on Program (specific loan).
I have a single dataset that looks like:
The Balance field is used as a running sum, i.e. to get the actual balance as of 4/30/2022, you would sum the column across all Date values on or before 4/30/2022.
The Rate field is the opposite: it represents the discrete interest rate as of the Date. Thus, it cannot be summed.
Each data point is specific to a specific loan, or Program.
So to get the interest rate of Program A as of 4/30/2022, you would simply grab the Rate value of the row where Date = 4/30/2022 and Program = A, or 5.30%. Sums are fine here, since the value of Rate is never repeated for a single Program and Date combo, but we cannot use a running sum.
On the other hand, to get the balance of Program A as of 4/30/2022, you would need to add (running sum) the Balance values for all rows where Date <= 4/30/2022 and Program = A, or 10,000 + -2500 + -2500 + -2500 = 2500.
Problem / Need
I need a report (or whatever it's called in Tableau) with the following:
Date as a column
Measures as rows
This report would NOT include Program as a row or column, but would include it as a filter.
In this report, I need a Weighted Average Cost of Funds measure.
This is effectively the weighted average Rate over/weighted by the running sum of Balance across Programs included in the filter, of course for any given Date in the columns.
In other words, by Date, latest Ratefor eachProgramtimes thePrograms running sum of Balance, divided by running sum of all Balancesfor allProgram`s included in filter.
Here's an example in Excel:
Here's an example if we were to exclude Program A:
And here's an example if we were to exclude Program B:
Finally, here's the formulas underneath everything in the Excel example:

What do you do if the sample size for an A/B test is larger than the population?

I have a list of 7337 customers (selected because they only had one booking from March-August 2018). We are going to contact them and are trying to test the impact of these activities on their sales. The idea is that contacting them will cause them to book more and increase the sales of this largely inactive group.
I have to setup an A/B test and am currently stuck on the sample size calculation.
Here's my sample data:
Data
The first column is their IDs and the second column is the total sales for this group for 2 weeks in January (i took 2 weeks as the customers in this group purchase very infrequently).
The metric I settled on was Revenue per customer (RPC = total revenue/total customer) so I can take into account both the number of orders and the average order value of the group.
The RPC for this group is $149,482.7/7337=$20.4
I'd like to be able to detect at least a 5% increase in this metric at 80% power and 5% significance level. First I calculated the effect size.
Standard Deviation of the data set = 153.9
Effect Size = (1.05*20.4-20.4)/153.9 = 0.0066
I then used the pwr package in R to calculate the sample size.
pwr.t.test(d=0.0066, sig.level=.05, power = .80, type = 'two.sample')
Two-sample t test power calculation
n = 360371.048
d = 0.0066
sig.level = 0.05
power = 0.8
alternative = two.sided
The sample size I am getting however is 360,371. This is larger than the size of my population (7337).
Does this mean I can not run my test at sufficient power? The only way I can determine to lower the sample size without compromising on significance or power is to increase the effect size to determine a minimum increase of 50% which would give me an n=3582.
That sounds like a pretty high impact and I'm not sure that high of an impact is reasonable to expect.
Does this mean I can't run an A/B test here to measure impact?

filtering using Rank() and Index() not changing the total

I am calculating efficiency for mechanics using the sum of hours worked divided by the sum of hours we charged the customer as per work order. Using tableau's total from the analytics pane, it gives me the weighted average of their efficiency (whereas the average function is skewed as it only takes into account the final efficiency rating.
When I use index() or rank() to create a filter to remove individual work orders, the total doesn't change.
How can I remove work orders and change the total without having to use a filter that selects individual work orders?
You could try using LOD with specific condition in the if statement before you take the average or do any calculation.
Since the fixed calculation will take the data directly from the table. The number will only changed with filter when you put the parameter in the first part of LOD.
A quick example:
{Fixed [parameter]: AVG(IF [work orders] == condition then [weighted average] END)}

Power BI: Finding average of averages and STDEV.P of averages

All,
My overall objective is to find outliers within an aggregated data set vs the underlying detail for different date ranges. The issue I am having is that Power BI is averaging the SalesPerDay and finding the STDEV.P at the daily level which is the grain of the raw data. I need to first find the average Sales, then find the average of those averages for that "rolled up" data set. Same with STDEV.P. Need to find the STDEV of the "rolled up" averages. Screenshot below depicting how I need the tool to aggregate.
I have brought the Sales column into my dashboard, dimentionalized by user, and set to AVERAGE to get average SalesPerDay.
Then I created the new measure
newavg = CALCULATE(AVERAGE(SalesPerDay[Sales]),ALLSELECTED())
Which is finding the overall average, but at the daily level vs the aggregated level.
I also tried
newSTDV = CALCULATE(STDEV.P(AVERAGE(SalesPerDay[Sales])),ALLSELECTED())
But you cannot find the STDEV.P of a calculation.
Thank you.
What you are looking for is the iterator functions, which take a table or column of data as a grouping, and then applies a calculation on that group.
Example of one would be SUMX. In the example below, it would do a grouping based on Product. Within each product it would get the total of qty and multiply it by the sum of x. It would then sum the results of that calculation into a total.
SUMX( VALUES( table1 [ Product ] ), [Qty] * [x] )
There also being averagex, minx, maxx, plus for the statistical functions there is STDEVX.P and STDEVX.S

Mahout Log Likelihood similarity metric behaviour

The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).
Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.
Question
First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.
Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!
Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).
Thanks in advance fellas
You are asking for 50 items similar to some item X. Then you look for 9 similar items for each of those 50. And most of them are the same. Why is that surprising? Similar items ought to be similar to the same other items.
What's a "centered" sum? ranking by sum rather than average still gives you a relatively similar output if the number of items in the sum for each calculation is roughly similar.
What problem are you trying to solve? Because none of this seems to have a bearing on the recommender system you describe that you're using and works. Log-likelihood similarity is not even based on ratings.