When to use user-user collaborative filtering and when to use Item-Item collaborative filtering? - collaborative-filtering

I am confused when to use user-user collaborative filtering and when to use Item-Item collaborative filtering?
Please help!!

If you have more users than items in your dataset which is generally the case, it would be effective to use item-item collaborative filtering. eg: Amazon would have a huge base of customers as compared to products.
Moreover the user preferences and liking changes over time so it is difficult to tackle this problem in user-user collaborative filtering but with item generally it is seen the rating of item doesn't change much over a course of time.

Item-Item:- Looks for the similar items, which user X has already rated and recommends the most similar items. Here similarity means how people treat two items in terms of ratings. If two items get same kind of ratings with the same users then they are similar.For example:-
Per1 Per2 Per3
Item1 5 3 1
Ttem2 2 3 3
Item vector_1 = 5P1 + 3P2 + 1P3
Item vector_2 = 2P1 + 3P2 + 3P3
If we calculate the cosine similarity of two vectors:
Cos_sim = (5*2 + 3*3 + 1*3) / sqrt((25+9+1)*(4+9+9)
Cos_sim = 0.792
User-User:- Find similarity between users by assessing the rating pattern of two users.
For example:-
Item1 Item2 Item3 Item4
Per_x 5 2 5 2
Per_y 5 2 5 2
Here two users are pretty similar. This might be you and your friend.
Hope that helps !!!

Related

Using theta sketch to count ad impressions and unqiue users

We're currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc
Currently, we don't have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.
We want to answer the following 2 queries to begin with :-
Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k
Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k
As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).
We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-
Date Dimension Name Dimension Value Unique-User-ID(Theta sketch)
2021/09/12 "Country" "US" 37873-3udif-83748-2973483
2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483
.
.
<Other records>
So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)
I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)
I've also read that the about theta sketch set ops accuracy here
I would like to know if there is a better approach to solve Q2(with or without Druid)
Also I would like to know how can I solve Q1?
If I replace Unique-User-Id with "Impression-Id", can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with "Impression-Id" then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.
Please share your thoughts about solving Q1 and Q2.
Regards
kartik

Tableau - Related Data Source Filter

I have data split between two different tables, at different levels of detail. The first table has transaction data that, in the fomrat:
category item spend
a 1 10
a 2 5
a 3 10
b 1 15
b 2 10
The second table is a budget by category in the format
category limit
a 40
b 30
I want to show three BANs, Total Spend, Total Limit, and Total Limit - Spend, and be able to filter by category across the related data source (transaction is related to budget table by category). However, I can't seem to get the filter / relationship right. That is, if I use category as a filter from the transaction table and set it to filter all using related data source, it doesn't filter the Total Limit amount. Using 2018.1, fyi.
Although you have data split across 2 tables they can be joined using the category field and available as a single data source. You would be then be able to use category as a quick filter.

Is this an approach to user-item recommendations that could work

I am designing an application that incorporates a recommendation system base on user interactions (collaborative filtering). The user on his homepage is presented a set of 6 items to interact with. There will be between 50 and 300 items. The following actions are possible:
click on an item (strong interest)
refresh an item (some interest)
open a read-more dialog (some interest)
don't do anything an move on (no interest)
This data is collected and stored. The system should recommend items of interest to the user. I'am thinking about turning this data into a rating system.
Option A) if the user clicks on an item, this is translated into a implicit lifetime rating of 5. refreshing an item it a 4 and so on. So my user->item matrix would look like this:
item 1 | item 2 | item 3
john 5 4
jane 4
In this example john has clicked on item 1 and refreshed item 3. The rating can only go up really, i.e. if a user has previously refreshed an item I write a 4 and update only to a 5 if the item is clicked later.
Option B) each time the user does one of the above actions, I'll increment a scalar value for the item, which means it can grow unbounded.
item 1 | item 2 | item 3
john 55 1 30
jane 41 9
Maybe this is a problem, since now the numbers are harder to translate into a rating scale from 1 to 10
Option C) I count every interaction separately
item 1 click | item 1 refresh | item 1 read
john 3 1
jane 1 1
Here the problem is that "reading about" an item is probably only done once.
Independent of whatever option I choose, my idea is to first find similar users using something like cosine similarity or pearson correlation. Then pick the top 10 to 30 users from that list and compile a toplist of their favorite items. From that list, I will then recommend items that the current user has had little interaction with in the past.
Is this something that could work? I am worried that finding similar users will eliminate the chance of finding interesting (new) items for the current user.
What you suggest sounds reasonable. Your concern about not finding new items is a reflection of the collaborative filtering method which is metadata-based. To find new items you would have to undoubtedly do some content analysis which would be a separate stage. For example, if your items are news articles you might try to identify important keywords for each user.

ATG - Can an Order Level Discount be applied to Items?

Consider I have two items in my cart 50 USD each , I also have a coupon '20 USD off 100'. when I apply this my cart will look like below (for simplicity and focus I am eliminating tax and shipping)
Item 1 50 X1 = 50
Item 2 50 X1 = 50
subtotal =100
discount (-)20
**Total 80
now I have multiple cases where I have to split this 20 USD to items so that returns at third party are easy , also in situations where the two items will be fulfilled by two independent vendors.
I understand that ATG's ReturnManager class provide wealth of methods to calculate returns and does consider all item shipping order discounts and taxes.
but is there a way I can split the order discount to items Out of the Box based on weighted average algorithm.
Thanks
The simplest way to do this would be, to do the splitting algorithm in a Order Pipeline Processor (custom processor) store the split Item Level Discount Shares in a ApportionmentInfo repository item. Whenever, a return occurs you need to access this repository item through the order and display it to the user.
Regards,
Gaurav E
You Can't
Basically OrderPricingEngine will set adjusment against OrderPriceInfo. you cant do apportionment that to item level because the trigger is on order.
Best practice is expose a reprice Order as service to thirdparty. Re run the calculation
and identify the return value. If you customize per promotion level it will open up pandora box of reprice order issue.

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.