User based collaborative filtering issue - collaborative-filtering

In the user based Collaborative Filtering, the picture shows the formula of how to predict the rating of an item. And the NSa is the nearest neighbor set of user a. j is the item to be predicted. rij means the rating of item j by the user i in the NSa. So, my question is,what if the user i has never voted the item j? How to handle the rij? Thanks!

The sum is really over all the users in NSa that have also rated j. That's the usual answer, to restrict it this way too. You could also use some dummy value here when it doesn't exist, like using the average rating of user i instead. I don't recommend this as it slows things down without adding information.

Related

Sphinx: Show all results order by previous searches

I use SphinxQL for searching and filtering in product database and I store last x search phrases of each user. I wonder if is it possible to show all products (all rows) to every user but with relevance on previous search.
Let's say one user sought for mobile phones (iphone, galaxy s7...), ie. electronics category. I want to show him all products randomly, but products from category electronics more often and products with those searched keywords even more often.
Is it even possible with Sphinx?
Thanks and sorry for english.
An alternative, would be perhaps to create random numbers attached to each result. A high and a low number, with an overlapping range.
sql_query = SELECT id, RAND()*100 AS rand_low, (RAND()*100)+50 AS rand_high, ...
sql_attr_uint = rand_low
sql_attr_uint = rand_high
Can then arrange the ranking expression to pick either of these numbers depending on if matches or not, and sort by the result.
SELECT id FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
OPTION ranker=expr('IF(doc_word_count>1,rand_high,rand_low)');
Will be mixed up. But results that match one of the words, have a greater chance of showing up first (because use the weighted random number) - its still only a chance, because a rand_high CAN still be smaller than rand_low.
... can change the size of the number 'overlap' to tweak the mix of matching/non matching results.
(added as a new answer as its a quite differnt idea, although uses the same 'all' keyword)
Sphinx doesn't have a 'mode' to just do that. But can get very close...
Can use MAYBE operator
MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
The complication is need a way to match all products. Depending on your data you may already have a word can use (eg word like 'the' in every single product), or add the word to every document, during indexing.
... using MAYBE allows the matching results to have a higher weight.
But you dont want to sort strictly by weight. So need a different alogithm, something to shuffle the results a bit (as you not really wanting 'random'!)
SELECT id, IDIV(id/10000) AS int,WEIGHT() AS w
FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
ORDER BY int DESC, w DESC;
This creates banding by ID, as in theory results can be spread over all the id-space will mix them up a bit. But the category results will still tend to be shown first within each band.
If you have one a different attribute other than ID might be better, something more spread out. Or can add a deliberate random attribute to results)
... there are all sort so variations, your imagination is the only limitation, this basic techqiue can be used to mix things up quote a bit.
(There are other possiblities, Sphinxes little known GROUP N BY function, can be used to produce a sampling search result. This is isnt random, but it might give the similar enough result - ie just mixing up results)

Mahout: Recommending Items for a user in particular product category

What do we have as of now? - We are using Mahout's GenericItemBasedRecommender to get a list of recommended products for a user using TanimotoCoefficientSimilarity as ItemSimilarity.
Where do we want to go from here? - The above works fine when we don't care about product category but what we want to know is the Product Category specific recommendations i.e. Say if a user has been buying, browsing, liking etc. specifically more in Men's and Gadgets category, I would then want to show this user recommendation in that specific category saying Recommended for you in [X] where X would be replaced by Mens or Gadgets in this case. We are thinking about a couple of options below to achieve this and we need some leads/opinion/feedback etc. so as to make sure we are going in the right direction. Options:
Firstly we'll have to move to a non-tanimoto version for calculating item similarity so that we account for users buying, liking, etc and not only view/browsing data.
Figuring out product category for a particular user (this is where we need direction) - Our product category hierarchy is basically a tree and we need to know which top 4 nodes (with best recommendations) in tree we would show to the user. Also if we are saying that node X is a category which we are showing to the user and node Y is a parent of node X we then don't want show user products in category Y or any parent for that matter. Couple of ways achieving this:
For every user calculate SUM of similarity scores values of items for a nodes at leaf level and recursively calculate for parent node till the root. Now at each node we have A = SUM of similarity scores & B = Number of Items Recommended so we also have A/B=Value (V) at each node. Now we pick the top 4 V values from the tree and recommend that to the user. The challenge here is that if we try to calculate this online during the request it we would tough to limit this under 150 ms for the entire request. An Example:
Root Level - Category12 (A=11, B=4) (category1 + category2)
|
_____________________|_________________________
/ \
/ \
Leaf Level - category1 (A=6, B=2) category2 (A=5, B=2)
Recommended products in Category 1: Item1 (score = 2), Item2 (score = 4)
Recommended products in Category 2: Item3 (score = 1), Item4 (score = 4)
Second option: For every category create a cluster of users based on their behaviour (likes, buying, viewing etc.) and then figure out the top 4 categories to which the user belongs. Not sure if we can achieve this using clustering in Mahout but I think we can do this offline.
Please provide your feedback/suggestions/leads/thoughts.
Thanks in advance!
If you want to model more than one thing in your data, I would suggest to use the SVD recommender instead with the ALSWR factorizer set to implicit feedback. With that done you can have user,item,preference in your data and the preference value would be how strongly associated your user is to the item. You can play with the numbers, for example a purchase is a 20 and a view is just a 2. I'm just throwing numbers here, I wouldn't know what will work best for your data, because you can also model things proportionally, as in if a purchase is 30 times less likely to happen than a view, then a purchase should be 30 times stronger than a view.
Mahout provides a way to influence the recommendations through the IDRescorer. You implement your own logic here and decide how to affect the recommendations. For example, the IDRescorer would check if a recommendation candidate belongs to the same category and if it does, boost the score by X. There's an example here (link) from the Mahout in Action Book (which you should definitely read), showing a rescorer.
Hope this helps

ColumnSorting with AsyncDataProvider - how to find out which column the user wants to sort by?

I am implementing a GWT CellTable with paging and sorting by multiple columns dynamically.
The basics can be found in the CellTable Developer's Guide.
However, the dynamic example does not tell how to find out by which column the user wants to sort (it simply sorts by the 'name' column). That's not enough in my case, as I want to allow the user to sort by different columns.
The only solution I could think of, which is not very elegant, is to keep track of which column is sorted in ascending order or not (using table.getColumnSortList(indexOfColumn).isAscending()) and then figuring out which one has been clicked by comparing the values for each column (the one that changed is probably what the user clicked).
This involves keeping information in my classes that should be available somewhere in the CellTable! But I can't find that information!
Thanks for any help.
I found the answer. As explained in the javadocs for com.google.gwt.user.cellview.client.ColumnSortList:
An ordered list containing the sort history of Columns in a table. The 0th item is the ColumnSortInfo of the most recently sorted column.
So, to know which column was last sorted by, you simply do:
ColumnSortInfo info = table.getColumnSortList().get(0);
Column<Type> sortByColumn = info.getColumn();

How do you store and display if a user has voted or not on something?

I'm working on a voting site and I'm wondering how I should handle votes.
For example on SO when you vote for a question (or answer) your vote is stored, and each time I go back on the page I can see that I already voted for this question because the up/down button are colored.
How do you do that? I mean I've several ideas but I'm wondering if it won't be an heavy load for the database.
Here is my ideas:
Write an helper which will check for every question if a voted has been casted
That's means that the number of queries will depends on the number of items displayed on the page (usually ~20)
Loop on my items get the ids and for each page write a query which will returns if a vote has been casted or NULL
Looks ok because only one query doesn't matter how much items on the page but may be break some MVC/Domain Model design, dunno.
When User log in (or a guest for whom an anonymous user is created) retrieve all votes, store them in session, if a new vote is casted, just add it to the session.
Looks nice because no queries is needed at all except the first one, however, this one and, depending on the number of votes casted (maybe a bunch for each user) can increase the size of the session for each users and potentially make the authentification slow.
How do you do? Any other ideas?
For eg : Lets assume you have a table to store votes and the user who cast it.
Lets assume you keep votes in user_votes when a vote is cast with a table structure something like the below one.
id of type int autoincrement
user_id type int, Foreign key representing users table
question_id type of int, Foreign key representing questions table
Now as the user will be logged in , when you are doing a fetch for the questions do a left join with the user_id in the user_votes table.
Something like
SELECT q.id, q.question, uv.id
FROM questions AS q
LEFT JOIN user_votes AS uv ON
uv.question_id = q.id AND
uv.user_id = <logged_in_user_id>
WHERE <Your criteria>
From the view you can check whether the id is present. If so mark voted, else not.
You may need to change your fields of the questions table and all. I am assuming you store questions in questions table and users in user table so and so. All having the primary key id .
Thanks
You could use a combination of your suggested strategies.
Retrieve all the votes made by the logged in user for recent/active questions only and store them in the session.
You then have the ones that are more likely to be needed while still reducing the amount you need to store in the session.
In the less likely event that you need other results, query for just those as and when you need to.
This strategy will reduce the amount you need to store in the session and also reduce the number of calls you make to your database.
Just based on the information than you've given so far, I would take the second approach: get the IDs of all the items on the page, and then do a single query to get all the user's votes for that list of item IDs. Then pass the collection of the user's item votes to your view, so it can render items differently when the user has voted for that item.
The other two approaches seem like they would tend to be less efficient, if I understood you correctly. Using a view helper to initiate an individual query for each item to check if the user has voted on it could lead to a lot of unnecessary queries. And preloading all the user's voting history at login seems to add unnecessary overhead, getting data that isn't always needed and adding the burden of keeping it up to date for the duration of the session.

Searches (and general querying) with HBase and/or Cassandra (best practices?)

I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id".
I want to be able to search by them. How do I do that properly? How to do that at all?
My understanding (will work for pretty much any key-value storage -- first goes key, then value)
u:123456789 = serialized_json_object
("u" as a simple prefix for user's keys, 123456789 is "unique id").
Now, thinking that I want to be able to search by firstname and lastname, I can save in:
f:Steve = u:384734807,u:2398248764,u:23276263
f:Alex = u:12324355,u:121324334
so key is "f" - which is prefix for firstnames, and "Steve" is actual firstname.
For "u:Steve" we save as value all user id's who are "Steve's".
That makes every search very-very easy. Querying by few fields (properties) -- say by firstname (i.e. "Steve") and lastname (i.e. "l:Anything") is still easy - first get list of user ids from "f:Steve", then list from "l:Anything", find crossing user ids, an here you go.
Problems (and there are quite a few):
Saving, updating, deleting user is a pain. It has to be atomic and consistent operation. Also, if we have size of value limited to some value - then we are in (potential) trouble. And really not of an answer here. Only zipping the list of user ids? Not too cool, though.
What id we want to add new field to search by. Eventually. Say by "city". We certainly can do the same way "c:Los Angeles" = ..., "c:Chicago" = ..., but if we didn't foresee all those "search choices" from the very beginning, then we will have to be able to create some night job or something to go by all existing User records and update those "c:CITY" for them... Quite a big job!
Problems with locking. User "u:123" updates his name "Alex", and user "u:456" updates his name "Alex". They both have to update "f:Alex" with their id's. That means either we get into overwriting problem, or one update will wait for another (and imaging if there are many of them?!).
What's the best way of doing that? Keeping in mind that I want to search by many fields?
P.S. Please, the question is about HBase/Cassandra/NoSQL/Key-Value storages. Please please - no advices to use MySQL and "read about" SELECTs; and worry about scaling problems "later". There is a reason why I asked MY question exactly the way I did. :-)
Being able to query properties directly is one of the features you lose when moving away from SQL, so you need a way to maintain your own index to let you find records.
If your datastore does not have built in indexing or atomic list operations, you will need to deal with the locking issues you mention. However, indexing doesn't necessarily need to be synchronous - maintain a queue of updated records to be reindexed and you have a solution for 3 that can be reused to solve 2 also.
If the index list for a particular value becomes too large for the system to handle in a single list, you can replace the list of users with a list of lists. However, if you have that many records with the same value it probably isn't a particularly useful search criteria anyway.
Another option that is useful in some cases is to use a seperate system for the indexing - for example you could set up lucene to index the records in your main datastore.
I guess i would have implemented this as a MapReduce job, which would run on schedule.
Each search word, would be a row-key with lookup to UID.
Rowkey:uid1
profile:firstName: Joe
profile:lastName: Doe
profile:nick: DoeMaster
Rowkey: uid2
profile:firstName: Jane
profile:lastName: Doe
profile:nick: SuperBabe
MapReduse indexes all searchable properties and add them with search word as row key
Rowkey: Jane
lookup:uid: uid2
Rowkey: Doe
lookup:uid: uid2, uid1
Rowkey: DoeMaster
lookup:uid: uid1
..etc
Now, if you need to update the index list on the fly as a user change, you would write the change directly to the index base, by remove uid value from index and add to another row key. In case of this happens at the same time, temporary locking could be implemented.
For users being removed, an additional attribute telling the state of the user could be use to filter them out from search.
Adding additional search word isn't very hard, since its just about which name:value you want to index. you could filter search more also by adding type attribute to your row key/keyword. i.e boston - lookup:type: city.
The idea is to maintain your own row key based search index inside hbase.