Content-based filtering recommendation system for a group of items - recommendation-engine

Is there any example of content-based filtering recommendation for a group of items? For example, user can choose a group of 10 movies they like the most, and the recommendation engine will return a list of movies that are most similar to the 10 movies they have chosen.
For a single observation I know we will calculate the distance between the single observation and the rest of the data, but for this group scenario, what would be the best way to implement it?

Related

How can I aggregate search results in Algolia by three different criteria and sort them in a specific way?

Apologies in advance, as I'm not a native English speaker! I'll try to be as clear as possible with what I'm trying to do:
I'm using Algolia InstantSearch in Angular in a marketplace website to provide my users with a search widget. I've been tasked with having results displayed following this logic:
Top result: Best reviewed product
Second result: Most purchased product
Third result: Most recently published product
This "block" should repeat as long as there's results, so the fourth result would need to be the second best reviewed product, fifth would be the second most purchased, third the second most recently published product, and so on. This has the intention of allowing new sellers in the marketplace to get exposure, while rewarding those that have sold the most and had better reviews for their products simultaneously.
Is this possible in some way using Algolia? I've read the documentation on custom ranking (https://www.algolia.com/doc/guides/managing-results/must-do/custom-ranking/) and exhaustive sorting (https://www.algolia.com/doc/guides/managing-results/refine-results/sorting/in-depth/exhaustive-sort/) and I've only found how to set different ranking criteria which are applied one after the other with tie-breakers, but no information at all about how I might achieve this.
As you've discovered with Algolia, sorting occurs within the index itself. To sort by different criteria you create a replica of your index with different sorting criteria.
So you'd have a primary index + 3 replicas sorted by reviews, purchases, and date.
my_index
my_index-reviews-descending
my_index-purchase-descending
my_index-added-descending
You'll use the multi-index query to get results from all three replicas simultaneously:
https://www.algolia.com/doc/guides/building-search-ui/ui-and-ux-patterns/multi-index-search/angular/
Also, don't forget that to sort by date you'll want to out your dates as Unix timestamps. More info here:
https://www.algolia.com/doc/guides/managing-results/refine-results/sorting/how-to/sort-an-index-by-date/

How do I design a data warehouse model that allows me to dynamically query for total action count, unique user count, and a count of total users

Currently facing a problem where I am trying to create a login utilization report for a web application. To describe the report a bit, users in our system are tagged with different metadata about the user. For example, I could be tagged with "New York City" and "Software Engineer", while other users may be tagged with different locations and job titles. The utilization report is essentially the following:
Time period (quarterly)
Total number of logins
Unique logins
Total users
"Engagement percentage" (Unique logins / Total users)
The catch is, the report needs to be a bit dynamic. I need to be able to be apply any combination of job titles and locations and have each of the numbers reflect the applied metadata. The time period also needs to be able to be easily adjusted to support weekly, monthly, and yearly as well. Ideally, I can create a view in Redshift that allows our BI software users to run this report whenever they see fit.
My question is, what is an ideal strategy to design a data model to support this report? I currently have an atomic fact table that contains all logins with this schema:
User ID
Login ID
Login Timestamp
Job Title Group ID (MD5 hash of job titles to support multi valued)
Location Group ID (MD5 hash of locations to support multi valued)
The fact table allows me to easily write a query to aggregate on total (count of login id) and unique (distinct count of user id).
How can I supplement the data I have to include a count of total users? Is what I currently have the best approach?
Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you’ve spent most of your career designing entity-relationship models for transaction processing systems, you’ll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.
It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.
In your case I recommend you to have this following design as a solution :

Firebase database Retrieve high score rank of a user [duplicate]

I have project that I need to display a leaderboard of the top 20, and if the user not in the leaderboard they will appear in the 21st place with their current ranking.
Is there efficient way to this?
I am using Cloud Firestore as a database. I believe it was mistake to choose it instead of MongoDB but I am in the middle of the project so I must do it with Cloud Firestore.
The app will be use by 30K users. Is there any way to do it without getting all the 30k users?
this.authProvider.afs.collection('profiles', ref => ref.where('status', '==', 1)
.where('point', '>', 0)
.orderBy('point', 'desc').limit(20))
This is code I did to get the top 20 but what will be the best practice for getting current logged in user rank if they are not in the top 20?
Finding an arbitrary player's rank in leaderboard, in a manner that scales is a common hard problem with databases.
There are a few factors that will drive the solution you'll need to pick, such as:
Total Number players
Rate that individual players add scores
Rate that new scores are added (concurrent players * above)
Score range: Bounded or Unbounded
Score distribution (uniform, or are their 'hot scores')
Simplistic approach
The typical simplistic approach is to count all players with a higher score, eg SELECT count(id) FROM players WHERE score > {playerScore}.
This method works at low scale, but as your player base grows, it quickly becomes both slow and resource expensive (both in MongoDB and Cloud Firestore).
Cloud Firestore doesn't natively support count as it's a non-scalable operation. You'll need to implement it on the client-side by simply counting the returned documents. Alternatively, you could use Cloud Functions for Firebase to do the aggregation on the server-side to avoid the extra bandwidth of returning documents.
Periodic Update
Rather than giving them a live ranking, change it to only updating every so often, such as every hour. For example, if you look at Stack Overflow's rankings, they are only updated daily.
For this approach, you could schedule a function, or schedule App Engine if it takes longer than 540 seconds to run. The function would write out the player list as in a ladder collection with a new rank field populated with the players rank. When a player views the ladder now, you can easily get the top X + the players own rank in O(X) time.
Better yet, you could further optimize and explicitly write out the top X as a single document as well, so to retrieve the ladder you only need to read 2 documents, top-X & player, saving on money and making it faster.
This approach would really work for any number of players and any write rate since it's done out of band. You might need to adjust the frequency though as you grow depending on your willingness to pay. 30K players each hour would be $0.072 per hour($1.73 per day) unless you did optimizations (e.g, ignore all 0 score players since you know they are tied last).
Inverted Index
In this method, we'll create somewhat of an inverted index. This method works if there is a bounded score range that is significantly smaller want the number of players (e.g, 0-999 scores vs 30K players). It could also work for an unbounded score range where the number of unique scores was still significantly smaller than the number of players.
Using a separate collection called 'scores', you have a document for each individual score (non-existent if no-one has that score) with a field called player_count.
When a player gets a new total score, you'll do 1-2 writes in the scores collection. One write is to +1 to player_count for their new score and if it isn't their first time -1 to their old score. This approach works for both "Your latest score is your current score" and "Your highest score is your current score" style ladders.
Finding out a player's exact rank is as easy as something like SELECT sum(player_count)+1 FROM scores WHERE score > {playerScore}.
Since Cloud Firestore doesn't support sum(), you'd do the above but sum on the client side. The +1 is because the sum is the number of players above you, so adding 1 gives you that player's rank.
Using this approach, you'll need to read a maximum of 999 documents, averaging 500ish to get a players rank, although in practice this will be less if you delete scores that have zero players.
Write rate of new scores is important to understand as you'll only be able to update an individual score once every 2 seconds* on average, which for a perfectly distributed score range from 0-999 would mean 500 new scores/second**. You can increase this by using distributed counters for each score.
* Only 1 new score per 2 seconds since each score generates 2 writes
** Assuming average game time of 2 minute, 500 new scores/second could support 60000 concurrent players without distributed counters. If you're using a "Highest score is your current score" this will be much higher in practice.
Sharded N-ary Tree
This is by far the hardest approach, but could allow you to have both faster and real-time ranking positions for all players. It can be thought of as a read-optimized version of of the Inverted Index approach above, whereas the Inverted Index approach above is a write optimized version of this.
You can follow this related article for 'Fast and Reliable Ranking in Datastore' on a general approach that is applicable. For this approach, you'll want to have a bounded score (it's possible with unbounded, but will require changes from the below).
I wouldn't recommend this approach as you'll need to do distributed counters for the top level nodes for any ladder with semi-frequent updates, which would likely negate the read-time benefits.
Final thoughts
Depending on how often you display the leaderboard for players, you could combine approaches to optimize this a lot more.
Combining 'Inverted Index' with 'Periodic Update' at a shorter time frame can give you O(1) ranking access for all players.
As long as over all players the leaderboard is viewed > 4 times over the duration of the 'Periodic Update' you'll save money and have a faster leaderboard.
Essentially each period, say 5-15 minutes you read all documents from scores in descending order. Using this, keep a running total of players_count. Re-write each score into a new collection called scores_ranking with a new field players_above. This new field contains the running total excluding the current scores player_count.
To get a player's rank, all you need to do now is read the document of the player's score from score_ranking -> Their rank is players_above + 1.
One solution not mentioned here which I'm about to implement in my online game and may be usable in your use case, is to estimate the user's rank if they're not in any visible leaderboard because frankly the user isn't going to know (or care?) whether they're ranked 22,882nd or 22,838th.
If 20th place has a score of 250 points and there are 32,000 players total, then each point below 250 is worth on average 127 places, though you may want to use some sort of curve so as they move up a point toward bottom of the visible leaderboard they don't jump exactly 127 places each time - most of the jumps in rank should be closer to zero points.
It's up to you whether you want to identify this estimated ranking as an estimation or not, and you could add some a random salt to the number so it looks authentic:
// Real rank: 22,838
// Display to user:
player rank: ~22.8k // rounded
player rank: 22,882nd // rounded with random salt of 44
I'll be doing the latter.
Alternative perspective - NoSQL and document stores make this type of task overly complex. If you used Postgres this is pretty simple using a count function. Firebase is tempting because it's easy to get going with but use cases like this are when relational databases shine. Supabase is worth a look https://supabase.io/ similar to firebase so you can get going quickly with a backend but its opensource and built on Postgres so you get a relational database.
A solution that hasn't been mentioned by Dan is the use of security rules combined with Google Cloud Functions.
Create the highscore's map. Example:
highScores (top20)
Then:
Give the users write/read access to highScores.
Give the document/map highScores the smallest score in a property.
Let the users only write to highScores if his score > smallest score.
Create a write trigger in Google Cloud Functions that will activate when a new highScore is written. In that function, delete the smallest score.
This looks to me the easiest option. It is realtime as well.
You could do something with cloud storage. So manually have a file that has all the users' scores (in order), and then you just read that file and find the position of the score in that file.
Then to write to the file, you could set up a CRON job to periodically add all documents with a flag isWrittenToFile false, add them all to the file (and mark them as true). That way you won't eat up your writes. And reading a file every time the user wants to view their position is probably not that intensive. It could be done from a cloud function.
2022 Updated and Working Answer
To solve the problem of having a leaderboards with user and points, and to know your position in this leaderboards in an less problematic way, I have this solution:
1) You should create your Firestorage Document like this
In my case, I have a document perMission that has for each user a field, with the userId as property and the respective leaderboard points as value.
It will be easier to update the values inside my Javascript code.
For example, whenever an user completed a mission (update it's points):
import { doc, setDoc, increment } from "firebase/firestore";
const docRef = doc(db, 'leaderboards', 'perMission');
setDoc(docRef, { [userId]: increment(1) }, { merge: true });
The increment value can be as you want. In my case I run this code every time the user completes a mission, increasing the value.
2) To get the position inside the leaderboards
So here in your client side, to get your position, you have to order the values and then loop through them to get your position inside the leaderboards.
Here you can also use the object to get all the users and its respective points, ordered. But here I am not doing this, I am only interested in my position.
The code is commented explaining each block.
// Values coming from the database.
const leaderboards = {
userId1: 1,
userId2: 20,
userId3: 30,
userId4: 12,
userId5: 40,
userId6: 2
};
// Values coming from your user.
const myUser = "userId4";
const myPoints = leaderboards[myUser];
// Sort the values in decrescent mode.
const sortedLeaderboardsPoints = Object.values(leaderboards).sort(
(a, b) => b - a
);
// To get your specific position
const myPosition = sortedLeaderboardsPoints.reduce(
(previous, current, index) => {
if (myPoints === current) {
return index + 1 + previous;
}
return previous;
},
0
);
// Will print: [40, 30, 20, 12, 2, 1]
console.log(sortedLeaderboardsPoints);
// Will print: 4
console.log(myPosition);
You can now use your position, even if the array is super big, the logic is running in the client side. So be careful with that. You can also improve the client side code, to reduce the array, limit it, etc.
But be aware that you should do the rest of the code in your client side, and not Firebase side.
This answer is mainly to show you how to store and use the database in a "good way".

Database schema for a tinder like app

I have a database of million of Objects (simply say lot of objects). Everyday i will present to my users 3 selected objects, and like with tinder they can swipe left to say they don't like or swipe right to say they like it.
I select each objects based on their location (more closest to the user are selected first) and also based on few user settings.
I m under mongoDB.
now the problem, how to implement the database in the way it's can provide fastly everyday a selection of object to show to the end user (and skip all the object he already swipe).
Well, considering you have made your choice of using MongoDB, you will have to maintain multiple collections. One is your main collection, and you will have to maintain user specific collections which hold user data, say the document ids the user has swiped. Then, when you want to fetch data, you might want to do a setDifference aggregation. SetDifference does this:
Takes two sets and returns an array containing the elements that only
exist in the first set; i.e. performs a relative complement of the
second set relative to the first.
Now how performant this is would depend on the size of your sets and the overall scale.
EDIT
I agree with your comment that this is not a scalable solution.
Solution 2:
One solution I could think of is to use a graph based solution, like Neo4j. You could represent all your 1M objects and all your user objects as nodes and have relationships between users and objects that he has swiped. Your query would be to return a list of all objects the user is not connected to.
You cannot shard a graph, which brings up scaling challenges. Graph based solutions require that the entire graph be in memory. So the feasibility of this solution depends on you.
Solution 3:
Use MySQL. Have 2 tables, one being the objects table and the other being (uid-viewed_object) mapping. A join would solve your problem. Joins work well for the longest time, till you hit a scale. So I don't think is a bad starting point.
Solution 4:
Use Bloom filters. Your problem eventually boils down to a set membership problem. Give a set of ids, check if its part of another set. A Bloom filter is a probabilistic data structure which answers set membership. They are super small and super efficient. But ya, its probabilistic though, false negatives will never happen, but false positives can. So thats a trade off. Check out this for how its used : http://blog.vawter.com/2016/03/17/Using-Bloomfilters-to-Avoid-Repetition/
Ill update the answer if I can think of something else.

REST - should endpoints include summary data?

Simple model:
Products, which have weights (can be mixed - ounces, grams, kilograms etc)
Cagtegories, which have many products in them
REST endpoints:
/products - get all products and post to add a new one
/products/id - delete,update,patch a single product
/categories/id - delete,update,patch a single category
/categories - get all categories and post to add a new one
The question is that the frontend client wants to display a chart of the total weight of products in ALL categories. Imagine a bar chart or a Pie chart showing all categories, and the total weight of products in each.
Obviously the product model has a 'weight_value' and a 'weight_unit' so you know the weight of a product and its measure (oz, grams etc).
I see 2 ways of solving this:
In the category model, add a computed field that totals the weight of all the products in a category. The total is calculated on the fly for that category (not stored) and so is always up to date. Note the client always needs all categories (eg. to populate drop downs when creating a product, you have to choose the category it belongs to) so it now will automatically always have the total weight of each category. So constructing the chart is easy - no need to get the data for the chart from the server, it's already on the client. But first time loading that data might be slow. Only when a product is added will the totals be stale (insignificant change though to the overall number and anyway stale totals are fine).
Create a separate endpoint, say categories/totals, that returns for all categories: a category id, name and total weight. This endpoint would loop through all the categories, calculating the weight for each and returning a category dataset with weights for each.
The problem I see with (1) is that it is not that performant. I know it's not very scalable, especially when a category ends up with a lot of products (millions!) but this is a hobby project so not worried about that.
The advantage of (1) is that you always have the data on the client and don't have to make a separate request to get the data for the chart.
However, the advantage of (2) is that every request to category/id does not incur a potentially expensive totalling operation (because now it is inside its own dedicated endpoint). Of course that endpoint will have to do quite a complex database query to calculate the weights of all products for all categories (although handing that off to the database should be the way as that is what db's are good for!)
I'm really stumped on which is the better way to do this. Which way is more true to the RESTful way (HATEOS basically)?
I would go with 2. favouring scalability and best practices. It does not really make any sense to perform the weight calculations every time the category is requested and even though you may not anticipate this to be a problem since it is a 'hobby' project, it's always best to avoid shortcuts where the benefits are minimal ( or so experience has taught me !).
Choosing 1, the only advantage would be that you have to set up one less endpoint and one extra call to get the weight total - the extra call shouldn't add too much overhead, and setting up the extra endpoint shouldn't take up too much effort.
Regardless of whether you choose 1 or 2, I would consider caching the weight total ( for a reasonable amount of time, depending on accuracy required ) to increase performance even further.