Mongodb aggregate works slowly with 200Gb collection - mongodb

I have a sharded mongo collection with 10M elements (200Gb).
Document structure:
{
_id
updateDate
cleanDate
events1: [{...}, {...}, ...]
events2: [{...}, {...}, ...]
events3: [{...}, {...}, ...]
}
There is no indexes except _id.
1st java application creates, reads and updates documents from collection.
2nd java application has scheduled task that finds documents with updateDate > cleanDate and removes old objects from eventX arrays. When the task cleans any object, it updates cleanDate.
I use myCollection.aggregate({ $project : { delta : { $cmp : ['$updateDate', '$cleanDate'] } } }, { $match : { delta : { $gt : 0 } } } , { $limit : 10000}) query to get next portion for cleaning.
The query execution takes a lot of time (sometimes 10 min and more) especially when the first elements was cleaned in collection.
How I can speed up my 2nd app?

The point is that whole your collection projects and $match only after that. If you know last updateDate or cleanDate (maybe you store last scheduled task run time somewhere) so you can limit initial aggregation set by this field prior projecting.

Related

Mongodb aggregation performance with indexes

I have a collection with a structure similar to this.
{
"_id" : ObjectId("59d7cd63dc2c91e740afcdb"),
"dateJoined": ISODate("2014-12-28T16:37:17.984Z"),
"dateActivatedMonth": 15,
"enrollments" : [
{ "month":-10, "enrolled":'00'},
{ "month":-9, "enrolled":'00'},
{ "month":-8, "enrolled":'01'},
//other months
{ "month":8, "enrolled":'11'},
{ "month":9, "enrolled":'11'},
{ "month":10, "enrolled":'00'}
]
}
dateActivatedMonth is number of months from dateJoined.
month in enrollments sub document is a relative month from dateJoined.
I am using Mongodb aggregation framework to process queries like "all enrollments with enrolled as '01' and 'enrolled month is 10 months before activation and 25 months after activation' ".
In my aggregation, first I am applying all possible filters in the $match pipeline and the applying condition on "month" in the $project pipeline.
db.getCollection("enrollments").aggregate(
{ $match:{ //initial conditions }},
{ $project: {
//Here i will apply month filter by considering the number of month before and after.
}
}
//other pipeline operations
)
All the filters that I am applying in my condition are optional. So in some cases $match will not filter anything. The only condition that is guaranteed to be there is the one on "month" in the "enrollments" sub document.
This query is slow. Takes about 6-7 seconds. (data is also huge). I am looking for ways to improve this. First thing I am looking at is to create indexes.
Now my questions are:
Can $project use indexes? I tried creating index on month" but I don't see anything about index usage in the queryPlanner with explain()
I would like to move the "month" condition to $match, So that it uses the index. How can I use a field value in $match? something like this:
db.getCollection('collection').aggregate([
{
$match:{
dateActivatedMonth: {$exists: true}
//,'enrollment.month': dateActivatedMonth -- not working
//,'enrollment.month': '$dateActivatedMonth' -- not working
}
}])
Thank you for your patience.

How to store an ordered set of documents in MongoDB without using a capped collection

What's a good way to store a set of documents in MongoDB where order is important? I need to easily insert documents at an arbitrary position and possibly reorder them later.
I could assign each item an increasing number and sort by that, or I could sort by _id, but I don't know how I could then insert another document in between other documents. Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
My first guess would be to increment the sequence of all of the following elements so that there would be space for the new element using a query something like db.items.update({"sequence":{$gte:6}}, {$inc:{"sequence":1}}). My limited understanding of Database Administration tells me that a query like that would be slow and generally a bad idea, but I'm happy to be corrected.
I guess I could set the new element's sequence to 5.5, but I think that would get messy rather quickly. (Again, correct me if I'm wrong.)
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet again, I might be wrong about that one too.)
I could have each document contain a reference to the next document, but that would require a query for each item in the list. (You'd get an item, push it onto the results array, and get another item based on the next field of the current item.) Aside from the obvious performance issues, I would also not be able to pass a sorted mongo cursor to my {#each} spacebars block expression and let it live update as the database changed. (I'm using the Meteor full-stack javascript framework.)
I know that everything has it's advantages and disadvantages, and I might just have to use one of the options listed above, but I'd like to know if there is a better way to do things.
Based on your requirement, one of the approaches could be to design your schema, in such a way that each document has the capability to hold more than one document and in itself act as a capped container.
{
"_id":Number,
"doc":Array
}
Each document in the collection will act as a capped container, and the documents will be stored as array in the doc field. The doc field being an array, will maintain the order of insertion.
You can limit the number of documents to n. So the _id field of each container document will be incremental by n, indicating the number of documents a container document can hold.
By doing these you avoid adding extra fields to the document, extra indices, unnecessary sorts.
Inserting the very first record
i.e when the collection is empty.
var record = {"name" : "first"};
db.col.insert({"_id":0,"doc":[record]});
Inserting subsequent records
Identify the last container document's _id, and the number of
documents it holds.
If the number of documents it holds is less than n, then update the
container document with the new document, else create a new container
document.
Say, that each container document can hold 5 documents at most,and we want to insert a new document.
var record = {"name" : "newlyAdded"};
// using aggregation, get the _id of the last inserted container, and the
// number of record it currently holds.
db.col.aggregate( [ {
$group : {
"_id" : null,
"max" : {
$max : "$_id"
},
"lastDocSize" : {
$last : "$doc"
}
}
}, {
$project : {
"currentMaxId" : "$max",
"capSize" : {
$size : "$lastDocSize"
},
"_id" : 0
}
// once obtained, check if you need to update the last container or
// create a new container and insert the document in it.
} ]).forEach( function(check) {
if (check.capSize < 5) {
print("updating");
// UPDATE
db.col.update( {
"_id" : check.currentMaxId
}, {
$push : {
"doc" : record
}
});
} else {
print("inserting");
//insert
db.col.insert( {
"_id" : check.currentMaxId + 5,
"doc" : [ record ]
});
}
})
Note that the aggregation, runs on the server side and is very efficient, also note that the aggregation would return you a document rather than a cursor in versions previous to 2.6. So you would need to modify the above code to just select from a single document rather than iterating a cursor.
Inserting a new document in between documents
Now, if you would like to insert a new document between documents 1 and 2, we know that the document should fall inside the container with _id=0 and should be placed in the second position in the doc array of that container.
so, we make use of the $each and $position operators for inserting into specific positions.
var record = {"name" : "insertInMiddle"};
db.col.update(
{
"_id" : 0
}, {
$push : {
"doc" : {
$each : [record],
$position : 1
}
}
}
);
Handling Over Flow
Now, we need to take care of documents overflowing in each container, say we insert a new document in between, in container with _id=0. If the container already has 5 documents, we need to move the last document to the next container and do so till all the containers hold documents within their capacity, if required at last we need to create a container to hold the overflowing documents.
This complex operation should be done on the server side. To handle this, we can create a script such as the one below and register it with mongodb.
db.system.js.save( {
"_id" : "handleOverFlow",
"value" : function handleOverFlow(id) {
var currDocArr = db.col.find( {
"_id" : id
})[0].doc;
print(currDocArr);
var count = currDocArr.length;
var nextColId = id + 5;
// check if the collection size has exceeded
if (count <= 5)
return;
else {
// need to take the last doc and push it to the next capped
// container's array
print("updating collection: " + id);
var record = currDocArr.splice(currDocArr.length - 1, 1);
// update the next collection
db.col.update( {
"_id" : nextColId
}, {
$push : {
"doc" : {
$each : record,
$position : 0
}
}
});
// remove from original collection
db.col.update( {
"_id" : id
}, {
"doc" : currDocArr
});
// check overflow for the subsequent containers, recursively.
handleOverFlow(nextColId);
}
}
So that after every insertion in between , we can invoke this function by passing the container id, handleOverFlow(containerId).
Fetching all the records in order
Just use the $unwind operator in the aggregate pipeline.
db.col.aggregate([{$unwind:"$doc"},{$project:{"_id":0,"doc":1}}]);
Re-Ordering Documents
You can store each document in a capped container with an "_id" field:
.."doc":[{"_id":0,","name":"xyz",...}..]..
Get hold of the "doc" array of the capped container of which you want
to reorder items.
var docArray = db.col.find({"_id":0})[0];
Update their ids so that after sorting the order of the item will change.
Sort the array based on their _ids.
docArray.sort( function(a, b) {
return a._id - b._id;
});
update the capped container back, with the new doc array.
But then again, everything boils down to which approach is feasible and suits your requirement best.
Coming to your questions:
What's a good way to store a set of documents in MongoDB where order is important?I need to easily insert documents at an arbitrary
position and possibly reorder them later.
Documents as Arrays.
Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
use the $each and $position operators in the db.collection.update() function as depicted in my answer.
My limited understanding of Database Administration tells me that a
query like that would be slow and generally a bad idea, but I'm happy
to be corrected.
Yes. It would impact the performance, unless the collection has very less data.
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet
again, I might be wrong about that one too.)
Yes. With Capped Collections, you may lose data.
An _id field in MongoDB is a unique, indexed key similar to a primary key in relational databases. If there is an inherent order in your documents, ideally you should be able to associate a unique key to each document, with the key value reflecting the order. So while preparing your document for insertion, explicitly add an _id field as this key (if you do not, mongo creates it automatically with a BSON objectid).
As far as retrieving the results are concerned, MongoDB does not guarantee the order of return documents unless you explicitly use .sort() . If you do not use .sort(), the results are usually returned in natural order (order of insertion).Again, there is no guarantee on this behavior.
I'd advise you to override _id with your order while inserting, and use a sort while retrieving. Since _id is a necessary and auto-indexed entity, you will not be wasting any space defining a sort key, and storing the index for it.
For abitrary sorting of any collection, you'll need a field to sort it on. I call mine "sequence".
schema:
{
_id: ObjectID,
sequence: Number,
...
}
db.items.ensureIndex({sequence:1});
db.items.find().sort({sequence:1})
Here is a link to some general sorting database answers that may be relevant:
https://softwareengineering.stackexchange.com/questions/195308/storing-a-re-orderable-list-in-a-database/369754
I suggest going with Floating point solution - adding a position column:
Use a floating-point number for the position column.
You can then reorder the list changing only the position column in the "moved" row.
If your user wants to position "red" after "blue" but before "yellow" Then you just need to calculate
red.position = ((yellow.position - blue.position) / 2) + blue.position
After a few re-positions in the same place (Cuttin in half every time) - you might reach a wall - it's better that if you reach a certain threshold - to resort the list.
When retrieving it you can simply say col.sort() to get it sorted and no need for any client-side code (Like in the case of a Linked list solution)

MongoDB: phantom records in $unwind results under heavy load

I have a simple collection of elements like this
{_id: n, xs: [...]}
I'm trying to count total number of elements in all arrays
db.testRace.aggregate([{ $unwind : "$xs" }, { $group : { _id : null, count : { $sum : 1 } } }])
And it works great unless I start to do massive updates of this collection. Under heavy load of update operations I get wrong total - slightly bigger than it should be.
It can be easily reproduced.
First generate some test data
for(var i = 1; i <= 1000000; i++) {
db.testRace.insert({_id: i, xs: [i]});
}
Then simulate a lot of updates
while(true) {
var id = Math.floor((Math.random() * 1000000) + 1);
var obj = db.testRace.find({_id: id}).next();
obj.some="change";
db.testRace.update({_id: id}, obj);
}
And while it is running do aggregate unwind query.
Without load I get right result - 1000000. But when there are a lot of updates I get bigger numbers, like 1001456.
And if I run query like this
db.testRace.aggregate([{ $unwind : "$xs" }, {$group: {_id:"$xs", count:{$sum: 1}}}, { $sort : { count : -1 } }, { $limit : 2 }]);
I get
"result" : [
{
"_id" : 996972,
"count" : 2
},
{
"_id" : 997789,
"count" : 2
}
],
So it seems aggregate count some records twice.
Is it expected behaviour or maybe I'm doing aggregation wrong?
I tested on local mongodb instance, version - 2.4.9
It's expected behavior due to the way MongoDB handles read isolation. When you have a long running query (and an aggregation that reads every single document is a long running query) with updates to that data during the query it may impact whether or no the updated data is returned in the query - depending on what happens when, you could miss a document, receive it or receive it twice.
From the source code:
Any data inserted, deleted, or modified during a yield that should be
returned by a query may or may not be returned by that query. The
query could return: nothing; the data before; the data after; or both
the data before and the data after.
In short, there is no isolation between a query and an
insert/delete/update. AKA, READ_UNCOMMITTED.
https://github.com/mongodb/mongo/blob/master/src/mongo/db/exec/plan_stage.h
Your aggregation query is yielding mid query, during which some of the data is updated. This impacts the results of the query.

MongoDB - Aggregation Framework (Total Count)

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.
There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});
Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.
$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.
If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});
I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/
in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.
I get total count with aggregate().toArray().length

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.