fairly simple time based queue in MongoDB - mongodb

I am trying to implement a fairly simple queue using MongoDB. I have a collection which a number of dumb workers needs to process. Each worker should search the collection for unprocessed work and then execute it.
The way I decide which work is unprocessed is based on a simple calculation.
Basically I have a collection of jobs which need to be performed at specific intervals, where the interval is stored in each document as interval, the worker will scan the collection for documents which have not been updated for at least the interval time.
An example of a document (_id field omitted) is:
{
updated: 0360,
interval: 60,
work: "an object representing the work"
}
What I want is an atomic/blocking query (there are multiple workers) which returns a batch of documents where updated + interval < currentTime, where currentTime is the time on the database server, as well as sets the updated field to currentTime.
In other words:
find: updated + interval < currentTime
return a batch of these, say 30
set: updated = currentTime
Any help is greatly appreciated!

Since MongoDB does not support transactions, you can't safely put a pessimistic lock on a batch of items, unless you have a separate document for that -- more on that at the end.
Let's start with the query: You can't query for sth. like 'where x + y < z' in MongoDB. Instead, you'll have to use a field for the next due date, e.g. nextDue:
{
"nextDue": "420",
"work": { ... }
}
Now each worker can fetch a couple of items (NOTE: this is all pseudo-code, not a specific programming language):
var result = db.queue.find( { "nextDue": { $gt, startTime } }).limit(50);
// hint: you can do a random skip here to decrease the chances of collisions
// between workers.
foreach(rover in result)
{
// pessimistic locking: '-1' indicates this is in progress.
// I'd recommend a flag instead, however...
var currentItem = db.queue.findAndModify({ "_id" : rover.id, "nextDue" : {$gt, startTime}}, {$set : {"nextDue" : -1}});
if(currentItem == null)
continue; // hit a lock: another worker is processing this already
// ... process job ...
db.queue.findAndModify({ "_id" : rover.id, "nextDue" : "-1"}, {$set : {"nextDue" : yourNextDue }});
}
There are essentially two methods I see for the pessimistic locking of multiple documents. One is to create bucket for the documents you're trying to lock, put the job descriptors in the bucket and process those buckets. Since now, the bucket is a single object, you can rely on the atomic modifiers.
The other one is to use two-phase commit, which also creates another object for the transaction, but does not require you to move your documents into a different document. However, this is a somewhat elaborate pattern.
The pseudocode I presented above worked very well in two applications, but in both applications the individual jobs took quite some time to execute (half a second to several hours).

Related

MongoDB - how to get fields fill-rates as quickly as possible?

We have a very big MongoDB collection of documents with some pre-defined fields that can either have a value or not.
We need to gather fill-rates of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.
Is there a way to use db.collection.aggregate or db.collection.mapReduce to run such a script server-side?
Should it have significant performance improvements?
Will it slow down other usages of that collection (e.g. holding a major lock)?
Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.
Here's how the old script worked (in node.js):
var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
var next = function() {
cursor.nextObject(function(err, doc) {
processDoc(doc, next);
});
};
next();
and this is the new script:
collection.mapReduce(
function () {
var processDoc = function(doc) {
...
};
processDoc(this);
},
function (key, values) {
return Array.sum(values)
},
{
query : query,
out: {inline: 1}
},
function (error, results) {
// print results
}
);
processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:
emit(field_name, 1);
running old and new on a sample of 100k, old took 20 seconds, new took 8.
some notes:
map-reduce's limit option doesn't work on sharded collections, I had to query for _id : { $gte, $lte} to create the sample size needed.
map-reduce's performance boost option: jsMode : true doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.
As I understood what you want to achieve is compute something on your documents, after that you have a new "document" that can be queried. You don't need to store the "new values" computed.
If you don't need to write your "new values" inside that documents, you can use Aggregation Framework.
Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
https://docs.mongodb.com/manual/aggregation/
Since Aggregation Framework has a lot of features i can't give you more informations about how to resolve your issue.

Huge query times when sort applied

I have a collection in MongoDB at about 1.1 million records. The average object size is 7.4kb so the database is around 8gb. I have an application which parses through the collection, but must be done synchronously ordered by the endedAt date in each record. It is also important that these are not live games (isLive: false), because otherwise the endedAt date won't exist. Once a record has been parsed, in order to ensure it isn't pulled in again, I set a value of isComplete: true to the record.
Now because the data must be returned to me the earliest first according to the endedAt date, I run the sort() function on the set. This seems to be a huge bottleneck for me right now.
My query for getting the next X rows to parse (remember, these need to be synchronous) is as follows:
db.matches.find({ isComplete: { $exists: false }, isLive: false }).limit(n)
When n is simply 5, the speed of the query is:
0.22s
However, when I add the necessary sort to the same query, because I absolutely must only return the next n rows by the earliest endedAt date (if they haven't already been parsed), the query time increases substantially to:
46.5s
The strange thing is, I've managed to parse a few hundred thousand games without problem, and the queries have gotten slower and slower until now where they effectively time-out. To most people this would immediately sound like an index problem, however I have indexes on the following fields:
idx_startedAt (1)
idx_endedAt (1)
idx_isComplete (1)
idx_isLive (1)
I'm not sure what else I should be indexing to increase the speed of this query, but I'm becoming pretty lost as to how best approach this problem. Any help as always much appreciated.
You need to index all of the filter criteria using a compound index, including the sort.
Filtering only a single field will still require scanning a large number of documents from disk and then sorting the results in memory. Indexing all of the fields, including the sort, will minimize the number of documents read from disk and prevent the need to sort the results in memory.
The ideal index for this query would be the following:
db.matches.createIndex({ "isLive" : 1, "isComplete" : 1, "endedAt" : 1 }, { "background" : true } )

mongodo index within a document or only across a document collection

I am collecting data from a streaming API and I want to create a real-time analytics dashboard. Every time a new record appears at the end of the stream I update a counter in the below document.
From a design perspective. Am I correct to use only one document, like in the below example?
{
"_id" : ObjectId("5238beb4d4bed9e444c99978"),
"counts" : {
"hours" : {
"1" : 835,
"2" : 1007,
.
.
.
"3" : 174,
}
}
The benefit with this approach is that only one document needs to be sent to the real-time analytics dashboard. Also after a year this document would have only 365 * 24 fields, 1 for each hour in that year?
What about indexing? Can I create an index on counts.hours if I only have one document? Or do indexes only work across collections in mongodb? Do indexes help with finding documents faster or also fields inside documents?
If I could create an index on counts.hours, then the counter increment process could find the correct hour to increment (per new document at the end of the stream) much more efficiently.
You can create indexes in fields embedded in a document. In the case above:
yourCollection.ensureIndex({ 'counts.hours':1 });
The index will help you optimize queries to return documents based on 'counts.hours' field.
youCollection.find({ 'count.hours':1 });
Your data structure design should depend on the kind of queries and updates you are planning to do. In the case you described I imagine you will be adding members to the 'hours' object, updates like that might be expensive since MongoDB pads each collection record optimizing for the case where the record size is stable across updates.

Slow pagination over tons of records in mongodb

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

MongoDB: Calling Count() vs tracking counts in a collection

I am moving our messaging system to MongoDB and am curious what approach to take with respect to various stats, like number of messages per user etc. In MS SQL database I have a table where I have different counts per user and they get updated by trigger on corresponding tables, so I can for example know how many unread messages UserA has without calling an expensive SELECT Count(*) operation.
Is count function in MongoDB also expensive?
I started reading about map/reduce but my site is high load, so statistics has to update in real time, and my understanding is that map/reduce is time consuming operation.
What would be the best (performance-wise) approach on gathering various aggregate counts in MongoDB?
If you've got a lot of data, then I'd stick with the same approach and increment an aggregate counter whenever a new message is added for a user, using a collection something like this:
counts
{
userid: 123,
messages: 10
}
Unfortunately (or fortunately?) there are no triggers in MongoDB, so you'd increment the counter from your application logic:
db.counts.update( { userid: 123 }, { $inc: { messages: 1 } } )
This'll give you the best performance, and you'd probably also put an index on the userid field for fast lookups:
db.counts.ensureIndex( { userid: 1 } )
Mongodb good fit for the data denormaliztion. And if your site is high load then you need to precalculate almost everything, so use $inc for incrementing messages count, no doubt.