Sorting in MongoDB's aggregation framework

Sorting in MongoDB's aggregation framework - mongodb

The docs for MongoDB seem to suggest that in order to sort the results of an aggregate call you should specify a dictionary/object like this:
db.users.aggregate(
{ $sort : { age : -1, posts: 1 } }
);
This is supposed to sort by age and then by posts.
What do I do if I want to sort by posts and then by age? Changing the order of the keys seems to have no effect, probably because this is a JS object's properties. In other words, sorting, it seems, is always according to the lexical order of the keys, which seems rather odd as a design choice...
Am I missing something? Is there a way to specify an ordered list of keys to sort by?

From the docs:
As python dictionaries don’t maintain order you should use SON or
collections.OrderedDict where explicit ordering is required eg
"$sort"
from bson.son import SON
db.users.aggregate([
{"$sort": SON([("posts", 1), ("age", -1)])}
])

Related

Looking for a scalable way to get all datatypes of _id in a mongodb collection without having to make an expensive aggregation call?

My use case is to get all datatype of _id present in a collection. I use these _id types in further downstream processes. This is primarily required as MongoDB doesn't itself return all datatypes with a $gte/$lte operation and only returns the values of the datatype mentioned. I'd need the datatypes of _id to achieve the necessary parallelizations. Currently, I'm using a DB aggregation call to the collection:
db.collection.aggregate({ $group : { _id : {$type:"$_id"}} } )
However, this is an extremely expensive call especially when the size of the collection is in the billions. I was hoping that since there is no internal comparator in MongoDB to distinguish between, say an objectID and a numeric type, I should be able to get the datatypes of _id present in some metadata ideally. Is there any such way to get this information or perhaps a faster way to get all datatypes of _id in a collection?

You can hint the queryPlanner to use the _id index in the aggregation query as follow:
db.collection.aggregate(
[
{$group:{
_id:{$type:"$_id"}
}
}
],
{
hint:"_id_"
}
)

Querying MongoDB subdocuments without respecting key order

I want to query against a subdocument. The problem is, that MongoDB seems to respect the order of the object keys. So if I do the following query, I won't receive any results:
db.getCollection('test').find({docId:'tLDmtdeYuG9DiGGrL',optimizeParams:{"deviceType":"mobile","daytime":"night"}})
If I change the order of optimizeParams, I will get a result:
db.getCollection('test').find({docId:'tLDmtdeYuG9DiGGrL',optimizeParams:{"daytime":"night","deviceType":"mobile"}})
Now a soultion would be to use dot notation, but in this case, I will receive ALL documents which contains both keys. But I only want the documents that ONLY have both keys (no others):
db.getCollection('test').find({docId:'tLDmtdeYuG9DiGGrL',"optimizeParams.deviceType":"mobile","optimizeParams.daytime":"night"})
Is there a way how I can execute a query without respecting the key order?

Even if there was a way to respect they key orders, I would recommend not to use it. JSON documents are meant to not respect the key orders. You should be designing your schema with this in mind.
For your particular use case, you could do this if you're using version 3.6 or higher:
db.test.aggregate([{
$match: {
"optimizeParams.daytime": "night",
"optimizeParams.deviceType": "mobile"
},
},{
$addFields: {
count: {
$size: {
$objectToArray: "$optimizeParams"
}
}
}
}, {
$match: {
count: 2
}
}])
This answer helped me with the above solution. Please keep indexes in mind when using it.
Although this could be done, "find documents with only these two keys" seems to be irrelevant from any application's perspective. There must be something meaningful that this situation represents. You could use another field to indicate that meaning, and use that to filter instead.

$elemMatch Projection on a Simple Array

Imagine a collection of movies (stored in a MongoDB collection), with each one looking something like this:
{
_id: 123456,
name: 'Blade Runner',
buyers: [1123, 1237, 1093, 2910]
}
I want to get a list of movies, each one with an indication whether buyer 2910 (for example) bought it.
Any ideas?
I know I can change [1123, 1237, 1093, 2910] to [{id:1123}, {id:1237}, {id:1093}, {id:2910}] to allow the use of $elemMatch in the projection, but would prefer not to touch the structure.
I also know I can perhaps use the $unwind operator (within the aggregation framework), but that seems very wasteful in cases where buyer has thousands of values (basically exploding each document into thousands of copies in memory before matching).
Any other ideas? Am I missing something really simple here?

You can use the $setIsSubset aggregation operator to do this:
var buyer = 2910;
db.movies.aggregate(
{$project: {
name: 1,
buyers: 1,
boughtIt: {$setIsSubset: [[buyer], '$buyers']}
}}
)
That will give you all movie docs with a boughtIt field added that indicates whether buyer is contained in the the movie's buyers array.
This operator was added in MongoDB 2.6.

Not really sure of your intent here, but you don't need to change the structure just to use $elemMatch in projection. You can just issue like this:
db.movies.find({},{ "buyers": { "$elemMatch": { "$eq": 2910 } } })
That would filter the returned array elements to just the "buyer" that was indicated, or nothing where this was not present. It is true to point out that the $eq operator used here is not actually documented, but it does exist. So that may not be immediately clear that you can construct a condition in that way.
It seems a little wasteful to me though as you are returning "everything" regardless of whether the "buyer" is present or not. So a "query" seems more logical than a projection:
db.movies.find({ "buyers": 2910 })
And optionally either just keeping only that matched result:
db.movies.find({ "buyers": 2910 },{ "buyers.$": 1})
Set operators in the aggregation framework give you more options with $project which can do more to alter the document. But if you just want to know if someone "bought" the item, then a "query" seems the be logical and fastest way to do so.

Iterating over distinct items in one field in MongoDB

I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.

From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.

A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)

#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])

I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.

How to index an $or query with sort

Suppose I have a query that looks something like this:
db.things.find({
deleted: false,
type: 'thing',
$or: [{
'creator._id': someid
}, {
'parent._id': someid
}, {
'somerelation._id': someid
}]
}).sort({
'date.created': -1
})
That is, I want to find documents that meets one of those three conditions and sort it by newest. However, $or queries do not use indexes in parallel when used with a sort. Thus, how would I index this query?
http://docs.mongodb.org/manual/core/indexes/#index-behaviors-and-limitations
You can assume the following selectivity:
deleted - 99%
type - 25%
creator._id, parent._id, somerelation._id - < 1%

Now you are going to need more than one index for this query; there is no doubt about that.
The question is what indexes?
Now you have to take into consideration that none of your $ors will be able to sort their data cardinally in an optimal manner using the index due to a bug in MongoDBs query optimizer: https://jira.mongodb.org/browse/SERVER-1205 .
So you know that the $or will have some performance problems with a sort and that putting the sort field into the $or clause indexes is useless atm.
So considering this the first index you want is one that covers the base query you are making. As #Leonid said you could make this into a compound index, however, I would not do it the order he has done it. Instead, I would do:
db.col.ensureIndex({type:-1,deleted:-1,date.created:-1})
I am very unsure about the deleted field being in the index at all due to its super low selectivity; it could, in fact, create a less performant operation (this is true for most databases including SQL) being in the index rather than being taken out. This part will need testing by you; maybe the field should be last (?).
As to the order of the index, again I have just guessed. I have said DESC for all fields because your sort is DESC, but you will need to explain this yourself here.
So that should be able to handle the master clause of your query. Now to deal with those $ors.
Each $or will use an index separately, and the MongoDB query optimizer will look for indexes for them separately too as though they are separate queries altogether, so something worth noting here is a little snag about compound indexes ( http://docs.mongodb.org/manual/core/indexes/#compound-indexes ) is that they work upon prefixes ( an example note here: http://docs.mongodb.org/manual/core/indexes/#id5 ) so you can't make one single compound index to cover all three clauses, so a more optimal method of declaring indexes on the $or (considering the bug above) is:
db.col.ensureindex({creator._id:1});
db.col.ensureindex({aprent._id:1});
db.col.ensureindex({somrelation._id:1});
It should be able to get you started on making optimal indexes for your query.
I should stress however that you need to test this yourself.

Mongodb can use only one index per query, so I can't see the way to use indexes to query someid in your model.
So, the best approach is to add special field for this task:
ids = [creator._id, parent._id, somerelation._id]
In this case you'll be able to query without using $or operator:
db.things.find({
deleted: false,
type: 'thing',
ids: someid
}).sort({
'date.created': -1
})
In this case your index will look something like this:
{deleted:1, type:1, ids:1, 'date.created': -1}

If you had flexibility to adjust the schema, I would suggest adding a new field, associatedIds : [ ] which would hold creator._id, parent._id, some relation._id - you can update that field atomically when you update the main corresponding field, but now you can have a compound index on this field, type and created_date which eliminates the need for $or in your query entirely.

Considering your requirement for indexing , I would suggest you to use $orderBy operator along side your $or query. By that I mean you should be able to index on the criteria's in your $or expressions used in your $or query and then you can $orderBy to sort the result.
For example:
db.things.find({
deleted: false,
type: 'thing',
$or: [{
'creator._id': someid
}, {
'parent._id': someid
}, {
'somerelation._id': someid
}]
},{$orderBy:{'date.created': -1}})
The above query would require compound indexes on each of the fields in the $or expressions combined with the sort object specified in the orderBy criteria.
for example:
db.things.ensureIndex{'parent._id': 1,"date.created":-1}
and so on for other fields.
It is a good practice to specify "limit" for the result to prevent mongodb from performing a huge in memory sort.
Read More on $orderBy operator here

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse