a database to store a non stationary distribution - nosql

I have many categorical distributions. A categorical distribution is where one describes the probability of an event drawn from a set of k events. I need to be able to access the probability of an event very quickly.
One way to store a categorical distribution is in Redis using a sorted set. Each key indexes a separate distribution, each member of the sorted set is a specific event and each score is the number of times you've seen that event. For each key(distribution) you would also store the sum of counts for each event in that distribution, so you can normalise properly.
The question I'd like to ask is: what is a good way to store this data if the probabilities are changing over time? I'd essentially like to be able to forget old observations - i.e. to decrement the score and normalisation constant for each key at some regular interval.
With the redis approach above, I could run a cron job every d minutes, iterate over each distribution and decrement each count in the zscore and the normalisation constant. However, this seems bit wrong (I'm sure my mum told me to never iterate over KEYS *), and so I'm wondering if anyone else has solved this problem a bit more comprehensively?

I'm guessing that what feels wrong to you is some combination of:
The need to visit every distribution, every member of each ZSET, and the normalisation constant whenever the cron job runs
The way that the unconditional decrement operation will, over time, skew distributions in favor of events that happen multiple times per cron cycle
I haven't done anything like this before, but one solution comes to mind if you're able to spare more storage.
The idea is to store, at a regular interval, a timestamped queue of snapshots. Each snapshot represents the event counts in your distributions for that interval of time. When you want to expire the old probabilities in your distribution, you pop the expired snapshots off the list and decrement the ZSETs accordingly.
More concretely, you'll need to:
Keep track in memory of the events that occur during the interval [tk - tk-1) and how many times each occurred -- a set of (event, count) pairs. This is in addition to the (presumably) real-time updating of the ZSET scores and normalisation factors that you currently do.
At each tick tk, store the snapshot:
Create a unique key Sk to represent the snapshot at tk -- like a UUID or similar
For each event E in the snapshot, create a unique hash key q(E). Choose a key encoding that will allow you to recover the distribution (ZSET) key and event (member) key for that event.
Call HSET Sk with the event key q(E) and event count |E| to store the event data. Repeat for all events in the snapshot.
RPUSH SNAPSHOTS <timestamp>:Sk
At each expiry tick tm, expire old snapshots:
LPOP the SNAPSHOTS list, decoding the timestamp and verifying whether expired.
If not expired, LPUSH it back onto the SNAPSHOTS list and you're done until the next expiry tick. Otherwise...
Decode the snapshot key Sk
Using the results of HKEYS Sk, decode each event key q(E), get the corresponding count, and then decrement the appropriate ZSET and normalisation factor by that amount.
Repeat while expired snapshots still exist in the SNAPSHOTS list.
The amount of extra storage required will depend on the length of the snapshot and expiry intervals and the number of distinct events that occur within each snapshot interval.
In the worst case, every distribution and event will be represented in each snapshot, so this will not help with wrongness factor #1. Optimistically, a suitably small percentage of distributions and/or events will be represented in any snapshot, and the efficiency of the expiration process will improve. But this will address wrongness factor #2 even in the worst case, since events that happened recently will not be unconditionally decremented in your distributions each time the expiration cron job runs.

Related

Finding a hash function that calculates average of an item in a stream of tuples

am new to this topic and I am trying to learn so my question might have been a bit confusing. What my question actually is: I have a stream of data coming in the form of tuple [Country, state, person]. now on this stream of data, I want to perform the operation of calculating the average number of people in the state. I though of doing it by taking the Key as [Country, State]. for every unique tuple, a hash function updates a bucket which contains the count.
For Eg: If I have a tuple [USA, Ohio, person1], then when this comes in bucket 2 is updated, and every time the tuple with USA and Ohio comes in, this count keeps increasing. this would give me the total number of people who are from USA-Ohio, but I am confused on how to find the average of it i.e the average number of people who belong to [USA,Ohio]. I hope this cleared up things a bit.

Firebase database Retrieve high score rank of a user [duplicate]

I have project that I need to display a leaderboard of the top 20, and if the user not in the leaderboard they will appear in the 21st place with their current ranking.
Is there efficient way to this?
I am using Cloud Firestore as a database. I believe it was mistake to choose it instead of MongoDB but I am in the middle of the project so I must do it with Cloud Firestore.
The app will be use by 30K users. Is there any way to do it without getting all the 30k users?
this.authProvider.afs.collection('profiles', ref => ref.where('status', '==', 1)
.where('point', '>', 0)
.orderBy('point', 'desc').limit(20))
This is code I did to get the top 20 but what will be the best practice for getting current logged in user rank if they are not in the top 20?
Finding an arbitrary player's rank in leaderboard, in a manner that scales is a common hard problem with databases.
There are a few factors that will drive the solution you'll need to pick, such as:
Total Number players
Rate that individual players add scores
Rate that new scores are added (concurrent players * above)
Score range: Bounded or Unbounded
Score distribution (uniform, or are their 'hot scores')
Simplistic approach
The typical simplistic approach is to count all players with a higher score, eg SELECT count(id) FROM players WHERE score > {playerScore}.
This method works at low scale, but as your player base grows, it quickly becomes both slow and resource expensive (both in MongoDB and Cloud Firestore).
Cloud Firestore doesn't natively support count as it's a non-scalable operation. You'll need to implement it on the client-side by simply counting the returned documents. Alternatively, you could use Cloud Functions for Firebase to do the aggregation on the server-side to avoid the extra bandwidth of returning documents.
Periodic Update
Rather than giving them a live ranking, change it to only updating every so often, such as every hour. For example, if you look at Stack Overflow's rankings, they are only updated daily.
For this approach, you could schedule a function, or schedule App Engine if it takes longer than 540 seconds to run. The function would write out the player list as in a ladder collection with a new rank field populated with the players rank. When a player views the ladder now, you can easily get the top X + the players own rank in O(X) time.
Better yet, you could further optimize and explicitly write out the top X as a single document as well, so to retrieve the ladder you only need to read 2 documents, top-X & player, saving on money and making it faster.
This approach would really work for any number of players and any write rate since it's done out of band. You might need to adjust the frequency though as you grow depending on your willingness to pay. 30K players each hour would be $0.072 per hour($1.73 per day) unless you did optimizations (e.g, ignore all 0 score players since you know they are tied last).
Inverted Index
In this method, we'll create somewhat of an inverted index. This method works if there is a bounded score range that is significantly smaller want the number of players (e.g, 0-999 scores vs 30K players). It could also work for an unbounded score range where the number of unique scores was still significantly smaller than the number of players.
Using a separate collection called 'scores', you have a document for each individual score (non-existent if no-one has that score) with a field called player_count.
When a player gets a new total score, you'll do 1-2 writes in the scores collection. One write is to +1 to player_count for their new score and if it isn't their first time -1 to their old score. This approach works for both "Your latest score is your current score" and "Your highest score is your current score" style ladders.
Finding out a player's exact rank is as easy as something like SELECT sum(player_count)+1 FROM scores WHERE score > {playerScore}.
Since Cloud Firestore doesn't support sum(), you'd do the above but sum on the client side. The +1 is because the sum is the number of players above you, so adding 1 gives you that player's rank.
Using this approach, you'll need to read a maximum of 999 documents, averaging 500ish to get a players rank, although in practice this will be less if you delete scores that have zero players.
Write rate of new scores is important to understand as you'll only be able to update an individual score once every 2 seconds* on average, which for a perfectly distributed score range from 0-999 would mean 500 new scores/second**. You can increase this by using distributed counters for each score.
* Only 1 new score per 2 seconds since each score generates 2 writes
** Assuming average game time of 2 minute, 500 new scores/second could support 60000 concurrent players without distributed counters. If you're using a "Highest score is your current score" this will be much higher in practice.
Sharded N-ary Tree
This is by far the hardest approach, but could allow you to have both faster and real-time ranking positions for all players. It can be thought of as a read-optimized version of of the Inverted Index approach above, whereas the Inverted Index approach above is a write optimized version of this.
You can follow this related article for 'Fast and Reliable Ranking in Datastore' on a general approach that is applicable. For this approach, you'll want to have a bounded score (it's possible with unbounded, but will require changes from the below).
I wouldn't recommend this approach as you'll need to do distributed counters for the top level nodes for any ladder with semi-frequent updates, which would likely negate the read-time benefits.
Final thoughts
Depending on how often you display the leaderboard for players, you could combine approaches to optimize this a lot more.
Combining 'Inverted Index' with 'Periodic Update' at a shorter time frame can give you O(1) ranking access for all players.
As long as over all players the leaderboard is viewed > 4 times over the duration of the 'Periodic Update' you'll save money and have a faster leaderboard.
Essentially each period, say 5-15 minutes you read all documents from scores in descending order. Using this, keep a running total of players_count. Re-write each score into a new collection called scores_ranking with a new field players_above. This new field contains the running total excluding the current scores player_count.
To get a player's rank, all you need to do now is read the document of the player's score from score_ranking -> Their rank is players_above + 1.
One solution not mentioned here which I'm about to implement in my online game and may be usable in your use case, is to estimate the user's rank if they're not in any visible leaderboard because frankly the user isn't going to know (or care?) whether they're ranked 22,882nd or 22,838th.
If 20th place has a score of 250 points and there are 32,000 players total, then each point below 250 is worth on average 127 places, though you may want to use some sort of curve so as they move up a point toward bottom of the visible leaderboard they don't jump exactly 127 places each time - most of the jumps in rank should be closer to zero points.
It's up to you whether you want to identify this estimated ranking as an estimation or not, and you could add some a random salt to the number so it looks authentic:
// Real rank: 22,838
// Display to user:
player rank: ~22.8k // rounded
player rank: 22,882nd // rounded with random salt of 44
I'll be doing the latter.
Alternative perspective - NoSQL and document stores make this type of task overly complex. If you used Postgres this is pretty simple using a count function. Firebase is tempting because it's easy to get going with but use cases like this are when relational databases shine. Supabase is worth a look https://supabase.io/ similar to firebase so you can get going quickly with a backend but its opensource and built on Postgres so you get a relational database.
A solution that hasn't been mentioned by Dan is the use of security rules combined with Google Cloud Functions.
Create the highscore's map. Example:
highScores (top20)
Then:
Give the users write/read access to highScores.
Give the document/map highScores the smallest score in a property.
Let the users only write to highScores if his score > smallest score.
Create a write trigger in Google Cloud Functions that will activate when a new highScore is written. In that function, delete the smallest score.
This looks to me the easiest option. It is realtime as well.
You could do something with cloud storage. So manually have a file that has all the users' scores (in order), and then you just read that file and find the position of the score in that file.
Then to write to the file, you could set up a CRON job to periodically add all documents with a flag isWrittenToFile false, add them all to the file (and mark them as true). That way you won't eat up your writes. And reading a file every time the user wants to view their position is probably not that intensive. It could be done from a cloud function.
2022 Updated and Working Answer
To solve the problem of having a leaderboards with user and points, and to know your position in this leaderboards in an less problematic way, I have this solution:
1) You should create your Firestorage Document like this
In my case, I have a document perMission that has for each user a field, with the userId as property and the respective leaderboard points as value.
It will be easier to update the values inside my Javascript code.
For example, whenever an user completed a mission (update it's points):
import { doc, setDoc, increment } from "firebase/firestore";
const docRef = doc(db, 'leaderboards', 'perMission');
setDoc(docRef, { [userId]: increment(1) }, { merge: true });
The increment value can be as you want. In my case I run this code every time the user completes a mission, increasing the value.
2) To get the position inside the leaderboards
So here in your client side, to get your position, you have to order the values and then loop through them to get your position inside the leaderboards.
Here you can also use the object to get all the users and its respective points, ordered. But here I am not doing this, I am only interested in my position.
The code is commented explaining each block.
// Values coming from the database.
const leaderboards = {
userId1: 1,
userId2: 20,
userId3: 30,
userId4: 12,
userId5: 40,
userId6: 2
};
// Values coming from your user.
const myUser = "userId4";
const myPoints = leaderboards[myUser];
// Sort the values in decrescent mode.
const sortedLeaderboardsPoints = Object.values(leaderboards).sort(
(a, b) => b - a
);
// To get your specific position
const myPosition = sortedLeaderboardsPoints.reduce(
(previous, current, index) => {
if (myPoints === current) {
return index + 1 + previous;
}
return previous;
},
0
);
// Will print: [40, 30, 20, 12, 2, 1]
console.log(sortedLeaderboardsPoints);
// Will print: 4
console.log(myPosition);
You can now use your position, even if the array is super big, the logic is running in the client side. So be careful with that. You can also improve the client side code, to reduce the array, limit it, etc.
But be aware that you should do the rest of the code in your client side, and not Firebase side.
This answer is mainly to show you how to store and use the database in a "good way".

How can I keep record (or see) of what value was given by the distribution each time the process took place?

I have beeen constructing a model in Anylogic in the last weeks, and I am currently simulating the time a truck takes to deliver its products, so I used a delay for this in which different parameters are multiplied to several distributions. Is there any was I can keep track of the value of the distribution each time the process takes place. The following is in example:
normal(2, 8, 4.67, 1.96)*DropSize
The DropSize is my parameter, but I wish to know what value was generated for the normal distribution, and keep track of this.
Sure, several ways (as usual with AnyLogic :-) ). Here is one:
create a collection of type ArrayList:
Then, create a function that draws the random value, stores it in the collection and returns it as below:
Last, replace your code creating the random value with calling that function. Now, whenever you pull a value from the distribution, it is also stored in the collection.
cheers

Why is my identifier collision rate increasing?

I'm using a hash of IP + User Agent as a unique identifier for every user that visits a website. This is a simple scheme with a pretty clear pitfall: identifier collisions. Multiple individuals browse the internet with the same IP + user agent combination. Unique users identified by the same hash will be recognized as a single user. I want to know how frequently this identifier error will be made.
To calculate the frequency, I've created a two-step funnel that should theoretically convert at zero percent: publish.click > signup.complete. (Users have to signup before they publish.) Running this funnel for 1 day gives me a conversion rate of 0.37%. That figure is, I figured, my unique identifier collision probability for that funnel. Looking at the raw data (a table about 10,000 rows long), I confirmed this hypothesis. 37 signups were completed by new users identified by the same hash as old users who completed publish.click during the funnel period (1 day). (I know this because hashes matched up across the funnel, while UIDs, which are assigned at signup, did not.)
I thought I had it all figured out...
But then I ran the funnel for 1 week, and the conversion rate increased to 0.78%. For 5 months, the conversion rate jumped to 1.71%.
What could be at play here? Why is my conversion (collision) rate increasing with widening experiment period?
I think it may have something to do with the fact that unique users typically only fire signup.complete once, while they may fire publish.click multiple times over the course of a period. I'm struggling however to put this hypothesis into words.
Any help would be appreciated.
Possible explanations starting with the simplest:
The collision rate is relatively stable, but your initial measurement isn't significant because of the low volume of positives that you got. 37 isn't very many. In this case, you've got two decent data points.
The collision rate isn't very stable and changes over time as usage changes (at work, at home, using mobile, etc.). The fact that you got three data points that show an upward trend is just a coincidence. This wouldn't surprise me, as funnel conversion rates change significantly over time, especially on a weekly basis. Also bots that we haven't caught.
If you really get multiple publishes, and sign-ups are absolutely a one-time thing, then your collision rate would increase as users who only signed up and didn't publish eventually publish. That won't increase their funnel conversion, but it will provide an extra publish for somebody else to convert on. Essentially, every additional publish raises the probability that I, as a new user, am going to get confused with a previous publish event.
Note from OP. Hypothesis 3 turned out to be the correct hypothesis.

How to approximate processing time?

It's common to see messages like "Installation will take 10 min aprox." , etc in desktop applications. So, I wonder how can I calculate an approximate of how much time a certain process will take. Off course I won't install anything but I want to update some internal data and depending on the user usage this might take some time.
Is this possible in a iPhone app? How Cocoa guys do this, would it be the same way in iPhone apps?
Thanks in advance.
UPDATE: I want to rewrite/edit some files on disk, most of the time these files are not the same size so I cannot use timers for the first iteration and calculate the rest from that.
Is there any API that helps on calculating this?
If you have some list of things to process, each "thing" - usually better to measure a group of 10 or so "things" - is a unit of work. Your goal is to see how long it takes to process a single group and report the estimated time to completion.
One way is to create an NSDate at the start of each group and a new one at the end (the top and bottom of your for loop) for each group. Multiply the difference in seconds by however many groups you have left (minus the one you just processed) and that should be a reasonable estimate of the time remaining.
Of course this gets more complicated if one "thing" takes a lot longer to process than another "thing" - the above approach assumes all things take the same amount of time. In this case, however, you may need to keep track of an average window (across the last n "things" or groups thereof).
A more detailed response would require more details about your model and what work you're performing.