MongoDB schema design and computation for aggregated data

MongoDB schema design and computation for aggregated data - mongodb

I have an application that stores users and their behavior in the form of events (in two collections). An important feature of the application is to query the users based on behavior and properties, for example: "All users with property X that did trigger event Y at least 10 times within the last Z days/weeks/months".
Until now I've been doing this on the raw data using aggregate() on the contact collection with a $lookup on events. However as the event collection grows this becomes slow.
My idea is to store some pre-aggregated version of the event data in a separate collection, with one document for every unique event/user combination. These documents would store:
the total number of times this event was triggered by this user
the last time he triggered it
how often he triggered it in some pre-defined time intervals for the last days/weeks/months
The documents in that collection would look like this:
{
event: 'my_event',
user: '593aaa84c685604066a6a0cf',
total: 79,
last: '2016-11-01T04:39:52.667Z',
days: { 0: 4, 1: 8, 2: 4 ... }, // 7 values here (0=today)
weeks: { ... }, // 3 values here
months: { ... } // 12 values here
}
However I'm struggling to figure out a good strategy for computing the values for the time intervals and how to keep this data up to date.
Do you have any suggestions or alternative approaches? Every idea is welcome :)

Related

Query Cloudant NoSQL database with respect to time intervals

I'am working on Cloudant NoSQL database, the data is received every 15 seconds, so in every 5 mins there will be 20 documents on the database.
I need to get documents which have a difference in timestamp of five minutes for example, Document with time stamp:"2017-03-14 T 18:21:58" and
Document with time stamp:"2017-03-14 T 18:26:58" and so on...
Sample document:

Make a view keyed on the timestamp. I interpret your question as "get all 20 documents sharing a particular timestamp". If that's the case you can get away with not parsing the timestamp:
function(doc) {
if (doc && doc.timestamp) {
emit(doc.timestamp, 1)
}
}
You can now query the view:
% curl 'https://USER:PASS#ACCOUNT.cloudant.com/DATABASE/_design/DDOC/view/VIEWNAME?key=TIMESTAMP&include_docs=true&limit=20'
Substitute the uppercase bits to the relevant names and parameters for your system.
If you need finer granularity for your query (time series style), there is an old-but-useful blog from one of the Cloudant founders on the topic here: https://cloudant.com/blog/mapreduce-from-the-basics-to-the-actually-useful/

Get top 50 records for a certain value w/ mongo and meteor

In my meteor project, I have a leaderboard of sorts, where it shows players of every level on a chart, spread across every level in the game. For simplicitys sake, lets say there are levels 1-100. Currently, to avoid overloading meteor, I just tell the server to send me every record newer than two weeks old, but that's not sufficient for an accurate leaderboard.
What I'm trying to do is show 50 records representing each level. So, if there are 100 records at level 1, 85 at level 2, 65 at level 3, and 45 at level 4, I want to show the latest 50 records from each level, making it so I would have [50, 50, 50, 45] records, respectively.
The data looks something like this:
{
snapshotDate: new Date(),
level: 1,
hp: 5000,
str: 100
}
I think this requires some mongodb aggregation, but I couldn't quite figure out how to do this in one query. It would be trivial to do it in two, though - select all records, group by level, sort each level by date, then take the last 50 records from each level. However, I would prefer to do it in one operation, if I could. Is it currently possible to do something like this?

Currently there is no way to pick up n top records of a group, in the aggregation pipeline. There is an unresolved open ticket regarding this: https://jira.mongodb.org/browse/SERVER-9377.
There are two solutions to this:
Keep your document structure as it is now and aggregate, but,
grab the n top records and slice off the remaining records for each group, in the client side.
Code:
var top_records = [];
db.collection.aggregate([
// The sort operation needs to come before the $group,
// because once the records are grouped by level,
// there exists only one document per group.
{$sort:{"snapshotDate":-1}},
// Maintain all the records in an array in sorted order.
{$group:{"_id":"$level","recs":{$push:"$$ROOT"}}},
],{allowDiskUse: true}).forEach(function(level){
level.recs.splice(50); //Keep only the top 50 records.
top_records.push(level);
})
Remember that this loads all the documents for each level and removes the unwanted records in the client side.
Alter your document structure to accomplish what you really need. If
you only need the top n records always, keep them always in sorted
order in the root document.This is accomplished using a sorted capped array.
Your document would look like this:
{
level:1,
records:[{snapshotDate:2,hp:5000,str:100},
{snapshotDate:1,hp:5001,str:101}]
}
where, records is an capped array of size n and always has sub documents sorted in descending order of their snapshotDate.
To make the records array work that way, we always perform an update operation when we need to insert documents to it for any level.
db.collection.update({"level":1},
{$push:{
recs:{
$each:[{snapshotDate:1,hp:5000,str:100},
{snapshotDate:2,hp:5001,str:101}],
$sort:{"snapshotDate":-1},
$slice:50 //Always trim the array size to 50.
}
}},{upsert:true})
What this does is, is always keeps the size of the records array to 50 and always sorts the records whenever new sub documents are inserted at a level.
A simple find, db.collection.find({"level":{$in:[1,2,..]}}), would give you the top 50 records in order, for each selected level.

Mongo query on keys in a object

My document structure:
{ _id:objectID,
month:'2014-01'
daily:{
'01':{},
'02':{},
'03':{}
.
.
.
'31':{}
}
}
Now, i want to query objects in daily, which is in a range 08 to 13 (for say), means only objects greater 08 and less than 13. These keys(01, 02,....31) in daily object are generated dynamically. I don't want to retrieve whole daily object and then process in backend. Please help.

You can't query for slices out of an embedded array. Since the daily array is embedded in the month document, you can't treat its individual entries as individual objects.
If your query looks for individual days, you should consider modeling your data appropriately, by creating a single document for each day. e.g.:
{
_id: { month: '2014-01', day: 1 },
/* rest of daily data here */
}
This will allow you to query for particular days with or without a specific month.

fairly simple time based queue in MongoDB

I am trying to implement a fairly simple queue using MongoDB. I have a collection which a number of dumb workers needs to process. Each worker should search the collection for unprocessed work and then execute it.
The way I decide which work is unprocessed is based on a simple calculation.
Basically I have a collection of jobs which need to be performed at specific intervals, where the interval is stored in each document as interval, the worker will scan the collection for documents which have not been updated for at least the interval time.
An example of a document (_id field omitted) is:
{
updated: 0360,
interval: 60,
work: "an object representing the work"
}
What I want is an atomic/blocking query (there are multiple workers) which returns a batch of documents where updated + interval < currentTime, where currentTime is the time on the database server, as well as sets the updated field to currentTime.
In other words:
find: updated + interval < currentTime
return a batch of these, say 30
set: updated = currentTime
Any help is greatly appreciated!

Since MongoDB does not support transactions, you can't safely put a pessimistic lock on a batch of items, unless you have a separate document for that -- more on that at the end.
Let's start with the query: You can't query for sth. like 'where x + y < z' in MongoDB. Instead, you'll have to use a field for the next due date, e.g. nextDue:
{
"nextDue": "420",
"work": { ... }
}
Now each worker can fetch a couple of items (NOTE: this is all pseudo-code, not a specific programming language):
var result = db.queue.find( { "nextDue": { $gt, startTime } }).limit(50);
// hint: you can do a random skip here to decrease the chances of collisions
// between workers.
foreach(rover in result)
{
// pessimistic locking: '-1' indicates this is in progress.
// I'd recommend a flag instead, however...
var currentItem = db.queue.findAndModify({ "_id" : rover.id, "nextDue" : {$gt, startTime}}, {$set : {"nextDue" : -1}});
if(currentItem == null)
continue; // hit a lock: another worker is processing this already
// ... process job ...
db.queue.findAndModify({ "_id" : rover.id, "nextDue" : "-1"}, {$set : {"nextDue" : yourNextDue }});
}
There are essentially two methods I see for the pessimistic locking of multiple documents. One is to create bucket for the documents you're trying to lock, put the job descriptors in the bucket and process those buckets. Since now, the bucket is a single object, you can rely on the atomic modifiers.
The other one is to use two-phase commit, which also creates another object for the transaction, but does not require you to move your documents into a different document. However, this is a somewhat elaborate pattern.
The pseudocode I presented above worked very well in two applications, but in both applications the individual jobs took quite some time to execute (half a second to several hours).

MongoDB: Calling Count() vs tracking counts in a collection

I am moving our messaging system to MongoDB and am curious what approach to take with respect to various stats, like number of messages per user etc. In MS SQL database I have a table where I have different counts per user and they get updated by trigger on corresponding tables, so I can for example know how many unread messages UserA has without calling an expensive SELECT Count(*) operation.
Is count function in MongoDB also expensive?
I started reading about map/reduce but my site is high load, so statistics has to update in real time, and my understanding is that map/reduce is time consuming operation.
What would be the best (performance-wise) approach on gathering various aggregate counts in MongoDB?

If you've got a lot of data, then I'd stick with the same approach and increment an aggregate counter whenever a new message is added for a user, using a collection something like this:
counts
{
userid: 123,
messages: 10
}
Unfortunately (or fortunately?) there are no triggers in MongoDB, so you'd increment the counter from your application logic:
db.counts.update( { userid: 123 }, { $inc: { messages: 1 } } )
This'll give you the best performance, and you'd probably also put an index on the userid field for fast lookups:
db.counts.ensureIndex( { userid: 1 } )

Mongodb good fit for the data denormaliztion. And if your site is high load then you need to precalculate almost everything, so use $inc for incrementing messages count, no doubt.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoDB schema design and computation for aggregated data - mongodb

Related

Query Cloudant NoSQL database with respect to time intervals

Get top 50 records for a certain value w/ mongo and meteor

Mongo query on keys in a object

fairly simple time based queue in MongoDB

MongoDB: Calling Count() vs tracking counts in a collection

Categories

Resources