Get top 50 records for a certain value w/ mongo and meteor - mongodb

In my meteor project, I have a leaderboard of sorts, where it shows players of every level on a chart, spread across every level in the game. For simplicitys sake, lets say there are levels 1-100. Currently, to avoid overloading meteor, I just tell the server to send me every record newer than two weeks old, but that's not sufficient for an accurate leaderboard.
What I'm trying to do is show 50 records representing each level. So, if there are 100 records at level 1, 85 at level 2, 65 at level 3, and 45 at level 4, I want to show the latest 50 records from each level, making it so I would have [50, 50, 50, 45] records, respectively.
The data looks something like this:
{
snapshotDate: new Date(),
level: 1,
hp: 5000,
str: 100
}
I think this requires some mongodb aggregation, but I couldn't quite figure out how to do this in one query. It would be trivial to do it in two, though - select all records, group by level, sort each level by date, then take the last 50 records from each level. However, I would prefer to do it in one operation, if I could. Is it currently possible to do something like this?

Currently there is no way to pick up n top records of a group, in the aggregation pipeline. There is an unresolved open ticket regarding this: https://jira.mongodb.org/browse/SERVER-9377.
There are two solutions to this:
Keep your document structure as it is now and aggregate, but,
grab the n top records and slice off the remaining records for each group, in the client side.
Code:
var top_records = [];
db.collection.aggregate([
// The sort operation needs to come before the $group,
// because once the records are grouped by level,
// there exists only one document per group.
{$sort:{"snapshotDate":-1}},
// Maintain all the records in an array in sorted order.
{$group:{"_id":"$level","recs":{$push:"$$ROOT"}}},
],{allowDiskUse: true}).forEach(function(level){
level.recs.splice(50); //Keep only the top 50 records.
top_records.push(level);
})
Remember that this loads all the documents for each level and removes the unwanted records in the client side.
Alter your document structure to accomplish what you really need. If
you only need the top n records always, keep them always in sorted
order in the root document.This is accomplished using a sorted capped array.
Your document would look like this:
{
level:1,
records:[{snapshotDate:2,hp:5000,str:100},
{snapshotDate:1,hp:5001,str:101}]
}
where, records is an capped array of size n and always has sub documents sorted in descending order of their snapshotDate.
To make the records array work that way, we always perform an update operation when we need to insert documents to it for any level.
db.collection.update({"level":1},
{$push:{
recs:{
$each:[{snapshotDate:1,hp:5000,str:100},
{snapshotDate:2,hp:5001,str:101}],
$sort:{"snapshotDate":-1},
$slice:50 //Always trim the array size to 50.
}
}},{upsert:true})
What this does is, is always keeps the size of the records array to 50 and always sorts the records whenever new sub documents are inserted at a level.
A simple find, db.collection.find({"level":{$in:[1,2,..]}}), would give you the top 50 records in order, for each selected level.

Related

MongoDB Collection Structure Performance

I have a MongoDB database of semi-complex records and my reporting queries are struggling as the collection size increases. I want to make some reporting Views that are optimized for quick searching and aggregating. Here is an sample format:
var record = {
fieldOne:"",
fieldTwo:"",
fieldThree:"", //There is approx 30 fields at this level
ArrayOne:[
{subItem1:""},
{subItem2:""} // There are usually about 10-15 items in this array
],
ArrayTwo:[
{subItem1:""}, //ArrayTwo items reference ArrayOne item ids for ref
{subItem2:""} // There are usually about 20-30 items in this array
],
ArrayThree:[
{subItem1:""},// ArrayThree items reference both ArrayOne and ArrayTwo items for ref
{subItem2:""},// There are usually about 200-300 items in this array
{subArray:[
{subItem1:""},
{subItem2:""} // There are usually about 5 items in this array
]}
]
};
I used to have this data where ArrayTwo was inside ArrayOne items and ArrayThree was inside ArrayTwo items so that referencing a parent was implied, but reporting became a nightmare with multiple nested levels of arrays.
I have a field called 'fieldName' at every level which is a way we target objects in the arrays.
I will often need to aggregate values from any of the 3 arrays across thousands of records in a query.
I see two ways of doing it.
A). Flatten and go Vertically, making a single smaller record in the database for every item in ArrayThree, essentially adding 200 records per single complex record. I tried this and I already have 200K records in 5 days of new data coming in. The benefit to this is that I have fieldNames that I can put indexing on.
B). Flatten Horizontally, making every array flat all within a single collection record. I would use the FieldName located in each array object as the key. This would make a single record with 200-300 fields in it. This would make a lot less records in the collection, but the fields would be dynamic, so adding indexes would not be possible(that I know of).
At this time, I have approx 300K existing records that I would be building this View off of. If I go vertical, that would place 60 Million simple records in the db and if I go Horizontal, it would be 300K records with 200 fields flattened in each with no indexing ability.
What's the right way to approach this?
I'd be inclined to stick with the mongo philosophy and do individual entries for each distinct set/piece of information, rather than relying on references within a weird composite object.
60 Million records is "a lot" (but it really isn't "a ton"), and mongodb loves to have lots of little things tossed at it. On the flipside, you'd end up with fewer big objects and take up just as much space.
(*using the wired tiger back end with compression will make your disk go further too).
**edit:
I'd also add that you really really really want indexes at the end of the day, so that's another vote for this approach.

Remove given number of records in Mongodb

I have Too much records in my Collection, can I have only desired number of records and remove others without any condition?
I have a collection called Products with around 10,0000 of records and its slowing down my Local application, I am thinking to shrink this huge amount of records to something around 1000, How can do it?
OR
How to copy a collection with limited number of records?
If you want to copy collection with limited number of records without any filter condition, for loop can be used . It copies 1000 document form originalCollection to copyCollection.
db.originalCollection.find().limit(1000).forEach( function(doc)
{db.copyCollection.insert(doc)} );

Maintaining order of mongodb collection

I have a collection that will have many documents (maybe millions). When a user inserts a new document, I would like to have a field that maintains the "order" of the data that I can index. For example, if one field is time, in this format "1352392957.46516", if I have three documents, the first with time: 1352392957.46516 and the second with time: 1352392957.48516 (20ms later) and the third with 1352392957.49516 (10ms later) I would like to have an another field where the first document would have 0, and the second would be 1, the third 2 and so on.
The reason I want this is so that I can index that field, then when I do a find I can do an efficient $mod operation to down sample the data. So for example, if I have a million docs, and I only want 1000 of them evenly spaced, I could do a $mod [1000, 0] on the integer field.
The reason I could not do that on the Time field is because they may not be perfectly spaced, or might be all even or odd so the mod would not work. So the separate integer field would keep the order in a linearly increasing fashion.
Also, you should be able to insert documents anywhere in the collection, so all subsequent fields would need to be updated.
Is there a way to do this automatically? Or would I have to implement this? Or is there a more efficient way of doing what I am describing?
It is well beyond "slower inserts" if you are updating several million documents for a single insert - this approach makes your entire collection the active working set. Similarly, in order to do the $mod comparison with a key value, you will have to compare every key value in the index.
Given your requirement for a sorted sampling order, I'm not sure there is a more efficient preaggregation approach you can take.
I would use skip() and limit() to fetch a random document. The skip() command will be scanning from the beginning of the index to skip over unwanted documents each time, but if you have enough RAM to keep the index in memory the performance should be acceptable:
// Add an index on time field
db.data.ensureIndex({'time':1})
// Count number of documents
var dc = db.data.count()
// Iterate and sample every 1000 docs
var i = 0; var sampleSize = 1000; var results = [];
while (i < dc) {
results.push(db.data.find().sort({time:1}).skip(i).limit(1)[0]);
i += sampleSize;
}
// Result array of sampled docs
printjson(results);

Is there a way to limit the number of records in certain collection

Assuming I insert the following records (e.g. foo1, foo2, foo3, foo4, .... foo10)
I would like the collection to retain only 5 records at any point in time (e.g. it could be foo1, ... foo5 OR foo2, ... foo6 or foo6, ... foo10)
How should I achieve this?
Sounds like you're looking for a capped collection:
Capped collections are fixed sized collections...
[...]
Once the space is fully utilized, newly added objects will replace the oldest objects in the collection.
You can achieve this using a command similar to
db.createCollection("collectionName",{capped:true,size:10000,max:5})
where 10000 is the size in bytes and 5 is the maximum number of documents you would restrict in the collection.
By using
db.createCollection("collectionName",{capped:true,size:10000,max:1000})
command, limits can be put on the no. of records in collection
Capped Collections sound great but it limits remove operation.
In a capped Collection you can only insert new elements and old elements will be removed automatically when certain size is reached.
However if you want to insert and delete any document in any order and limit the maximum number of document in a collection, mongoDB does not offer a direct solution.
When I encounter this problem I use an extra count variable in another collection.
This collection has a validation rule that avoids count variables to become negative.
Count variable should always be non negative integer number.
"count": {
"$gte": 0
}
The algorithm was simple.
Decrement the count by one.
If it succeed insert the document.
If it fails it means there is no space left.
Vice versa for deletion.
Also you can use transactions to prevent failures(Count is decremented but service is failed just before insertion operation).

How to find the row number of a row in a sorted MongoDB collection to calculate its percentile?

I have a large MongoDB collection that contains a userID and a counter representing total hits for that user over time. I'd like to be able to calculate a given users percentile.
Conceptually, what I'd like to do is sort the collection and then get the row number for that given user's record and divide that number by the total count for the collection:
percentile = row_index / total_rows;
How would this be accomplished in MongoDB?
Get total count by db.yourCollection.count()
Then count record that have larger number using
db.yourCollection.find({$gte: value}).count()
If total count = 1000, count for larger or equal = 950 then you've got in 950/1000 - top 95%
But if you use your collection often in read mode and rare in write mode I'd suggest to make new temp collection using MapReduce to have records {_id:..., percent:...}
The trivial solution here is to sort by total hits descending. You then cursor through the results until you find your UserID.
Clearly, this solution does not provide great performance if you have to run it a lot. It's easy to get a "top 20", but it's far more computation get a "bottom 25%".
If this query is really important or you're running it a lot, there are couple of workarounds.
I think the easiest one is simply to run a job that builds the percentiles for you on a regular basis. Basically you build a collection that looks like this:
{ percent : 95, score : 888888 }
{ precent : 90, score : 777777 }
...
To get a user's percentile, you just look up their score in that relatively small collection. To update those scores, simply run a job on a regular basis that loops through all of the user.