Iterating over distinct items in one field in MongoDB - mongodb

I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.

From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.

A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)

#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])

I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.

Related

This question is regarding the match and sort oprimization is MongoDb

{
"_id" : ObjectId("62c3aa311984f666ef75d1n7"),
"eventCode" : "332",
"time" : 1657008013000.0,
"dat" : "61558575921c023a93f81362",
}
This is how a document looks like, now I need to calculate some value for which I am using aggregation pipeline and I am using the match and sort operators first, what I am using is.
$match: {
dat: { $regex: "^" + eventStat.dat },
time: {
$gte: eventStat.time.from,
$lte: eventStat.time.to,
},
},
$sort: { time: 1 }
So I am using this two opeartors in the pipeline first,
Now Mongodb Document says that aggregation will always implement match first before sort but in some cases it performs sort first, I am not sure but I think that happens when there is a index on field key used in sort not present in match and Mongodb decides it better to sort first.
Here I am using time in both match and sort so I want to know that is there still any case possible where sort might happen before match?
If yes, I read that a dummy project operator can force it to match first but what exactly is a dummy project opeartor?
Most questions about how the database is executing a query can be answered (or at least further reasoned about) by inspecting the explain plan(s) associated with the operation(s). Let's first address a few of your statements directly before turning to inspect explain plans ourselves.
Now Mongodb Document says that aggregation will always implement match first before sort
Where does it say this?
In general, all databases are required to provide results that are semantically valid relative to the query that the client issued. This gets mentioned often when SQL is being discussed as it is a "declarative language". This means that users describe what data they want rather than how to retrieve that data.
MongoDB's aggregation framework is a bit less declarative than SQL. Or said another way, the aggregation framework is a little more descriptive in how to do things. This is because the ordering that the stages are defined in for a pipeline help define the semantics of the results. If, for example, one were to $project out a field first and then attempt to use that (no longer present) field in a subsequent stage (such as a $match or $group), MongoDB would not make any adjustments to how it processes the pipeline to make that field available to that later stage. This is because the user specifically requested the removal of that stage earlier in the pipeline which is part of the semantics for the overall pipeline.
Based on this (and another factor that we will talk about next), I would be surprised to see any documentation suggesting that the database always performs a match stage before a sort stage.
but in some cases it performs sort first, I am not sure but I think that happens when there is a index on field key used in sort not present in match and Mongodb decides it better to sort first.
Again returning to generalizations about all databases, one of their primary jobs is to return data to clients as efficiently as possible. So as long as their approach at executing the query does not logically change the results based on the semantics expressed by the client in the query, the database can gather the results in any manner that it thinks will be the most effective.
For aggregation specifically, this most commonly means that stages will either get reordered or combined altogether for execution. Some of the changes that the database will attempt to do are outlined on the Aggregation Pipeline Optimization page.
Logically, filtering data and then sorting it yields the same results as sorting the data and then filtering it. So indeed, one of the optimizations outlined on that page is indeed reordering $match and $sort stages.
The important thing to keep in mind here is mentioned at the very top of that page. The database "attempts to reshape the pipeline for improved performance", but how effective these adjustments are depend on other factors. The biggest factor for many of these is the presence (or absence) of an associated index to support the (reordered) pipeline.
Here I am using time in both match and sort so I want to know that is there still any case possible where sort might happen before match?
Unless you are explicitly forcing the database to use a particular plan (such as by hinting it), there is always a chance that it will choose to do something unexpected. Databases are quite good at picking optimal plans though and are always improving with each new release, so ideally we'd leave the system to do its work and not try to do that work for the database (with hints or otherwise). In your particular situation, I believe we can design an approach that is highly optimized for both the $match and the $sort setting it up for success.
If yes, I read that a dummy project operator can force it to match first but what exactly is a dummy project opeartor?
It sounds like this is also asking about other ways in which we could manually influence plan selection. We are going to stay away from that as it is fragile, not something we should rely on long term, and unnecessary for our purposes anyway.
Inspecting Explain
So what happens if we have an index on { time: 1 } and we run the aggregation? Well, the explain output (on 6.0) shows us the following:
queryPlanner: {
parsedQuery: {
'$and': [
{ time: { '$lte': 100 } },
{ time: { '$gte': 0 } },
{ dat: { '$regex': '^ABC' } }
]
},
...
winningPlan: {
stage: 'FETCH',
filter: { dat: { '$regex': '^ABC' } },
inputStage: {
stage: 'IXSCAN',
keyPattern: { time: 1 },
indexBounds: { time: [ '[0, 100]' ] }
...
}
},
Notice that there is no $sort stage at all. What has happened is that the database realized that it could use the { time: 1 } index to do two things at the same time:
Filter the data according to the range predicates on the time field.
Walk the index in the requested sort order without having to manually do so.
So if we go back to the main original question of whether aggregation will perform the match or sort first, we now see that a third option is for the database to do both activities them at the same time!
At the very least, you should have an index on { time: 1 }.
Ideally you would instead have a compound index on the other field (dat) as well. There is a bit of a wrinkle here in that you are currently applying a regex operator against the field. If the filter were a direct equality match, the guidance would be easy (prepend dat: 1 as the first key in the compound index).
Without knowing more about your situation, it's unclear which of the two compound indexes the database could use more effectively to support this operation. If the regex filter on dat is highly selective, then { dat: 1, time: 1 } will probably be ideal. It will require a manual sort, but that can all be done after scanning the index before retrieving the full documents. If the regex filter on dat is not very selective, then { time: 1, dat: 1 } may be ideal. This would prevent the need to manually sort, but will result in some additional index key scanning.
In either case, examining explain output may be helpful in finding the approach that is best suited for your particular situation.

MongoDB query subset of array of DBRefs

My main object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
...
"intervalAbsenceDates" : [
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a06560fd7047318463b")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184467")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184468")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184469")),
An embedded ScheduleIntervalContainer object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318463b"),
"end" : ISODate("2022-08-23T07:06:00Z"),
"available" : true,
"confirmation" : true,
"start" : ISODate("2022-08-19T09:33:00Z")
}
Now I will query all ScheduleIntervalContainers where start and end is in a range.
I have tried a lot but I not even can query one ScheduleIntervalContainer by id.
This is my approach:
db.InstitutionUserConnection.find( {
"intervalAbsenceDates" : {
"$ref" : "ScheduleIntervalContainer",
"$id" : ObjectId("56eb0a05560fd7047318446d")
}
})
Could anyone give me a hint how to query all ScheduleIntervalContainers which have start and end in a time range.
Using DBRef is fraught with problems and the usage is somewhat "antiquated" as it was more of a "shoehorn" solution to requests to provide a mechanism for providing references to external collection data than a really well thought solution.
The general recommendation is to not use DBRef and rather simply use a plain ObjectId or other unique identifier and resolve the actual "collection" or even "database container" with other information, being either another standard property stored in the document or just a plain external reference in your code which identifies the target collection.
The Basic BSON Problem
A good reason for this is that people often make the common mistake ( just like you have ) of interpretting the "serialized" output from a stored object containing a DBRef to actually be "properties" present on the object. This is not true, as in fact the object has it's own BSON type, just like Date and ObjectId. The only time that serialized form is valid is when used with a "strict mode" JSON parser, which would yeild the actual DBRef object.
The manual itself is not particularly helpful with that fact and says "DBRefs have the following fields". This is misleading since those properties are not actually available as fields for query, and would not be exposed other than object inspection available to JavaScript processing of $where or mapReduce. And not a good approach to use either of those for "query" purposes.
So those properties are not available for query, but you can of course specifiy the BSON form directly from your API. Such as:
db.InstitutionUserConnection.find({
"intervalAbsenceDates": DBRef(
"ScheduleIntervalContainer",
ObjectId("56eb0a05560fd70473184469")
)
})
Which resolves and matches correctly since the correct BSON form was sent in a query and the element will actually match.
From this principle we can then move on to the the concept of "ranges" as asked in the question.
MongoDB does not "really" do Joins
Until recent releases ( MongoDB 3.2.x series ) the statement was actually "MongoDB does not do joins", and this has been a general distinction in design philosophy away from relational databases.
The general mantra "has" been that "joins are costly" and therefore do not scale well in distributed data systems such as what MongoDB is principally designed for.
So if you are asking for referencing a property in the document which the DBRef refers to, then you are basically out of luck. The only possible action for filtering results based on an external property in this case would be:
Look at all the data in the master collection, either loaded to query in whole or processed individually
For each retrieved document, lookup and expand the DBRef values to their target collection data.
Filter out documents whose expanded data from external references do not actually meet the conditions.
This means all the "expansion" and "filtering" must take place on the "client" to the database rather than on the server itself. There simply is no mechanism to do this, so you end up pulling a lot of data over the network connection.
With a DBRef in place this is purely not possible to perform on the server even with modern releases. Again it's the same BSON type problem, since the "source" contains DBRef types and the "target" contains ObjectId types.
If however you can simply live with the fact that your "range" is looking at the "creation date" data that would be inherently present in any ObjectId, then there is of course another approach that does not involve a "join".
Filtering on ObjectId "range"
Every ObjectId starts with 4-bytes that represents the current timestamp value ( excluding milliseconds ) at the time the ObjectId was created. This is generally a good indicator of the time of "insertion" for the document in question where it was used.
From this you can determine that the "created date" of the documents refernced with DBRef is approximately equal to that portion of the ObjectId value used by that document in the target collection. This allows you to basically construct a "range" value for ObjectId values that would fall between the given range. As a consequence, you can then construct DBRef BSON objects that would work with range operators:
// Define start and end dates to query
var dateStart = new Date("2016-03-17T19:48:21Z"), // equal to 56eb0a05
dateEnd = new Date("2016-03-17T19:48:25Z"); // equal to 56eb0a09
// Convert to hex and pad to ObjectId length
var startRange = new ObjectId(
( dateStart.valueOf() / 1000 ).toString(16) + "0000000000000000"
),
// Yields ObjectId("56eb0a050000000000000000")
endRange = new ObjectId(
( dateEnd.valueOf() / 1000 ).toString(16) + "ffffffffffffffff"
);
// Yields ObjectId("56eb0a09ffffffffffffffff")
// Now query with contructed DBRef values
db.InstitutionUserConnection.find({
"intervalAbsenceDates": {
"$elemMatch": {
"$gte": DBRef("ScheduleIntervalContainer",startRange),
"$lt": DBRef("ScheduleIntervalContainer",endRange),
}
}
})
So as long as "created" is what you are looking for, then that method should suffice for selecting the matching parents without first expanding the DBRef values in the array for further inpection.
The Reverse Case Lookup
Of course the other case here is to simply query the "joined" collection first and then look for documents in the "master" collection that contain the ObjectId values within the DBRef. This does of course mean multiple queries to be issued, but it does cure the case of expanding every DBRef just to match the related properties:
// Create array of matching DBRef values
var refs = db.ScheduleIntervalContainer.find({
"start" { "$lte": targetDate },
"end": { "$gte": targetDate }
}).map(function(doc) {
return DBRef("ScheduleIntervalContainer",doc._id)
});
// Find documents that match the DBRef's within the array
db.InstitutionUserConnection.find({
"intervalAbsenceDates": { "$in": refs }
})
The practicallity of this varies on the number of matches from the related collection resulting in the array that would be passed to $in, but it does actually yield the desired result.
Actually doing Joins
I mention earlier that "modern" MongoDB releases now have an approach to "joining" data from different collections. This is the $lookup aggregation pipeline operator.
But while this can be used to "join" data, the current usage of DBRef does not work here. I did also mention earlier that the basic problem is the data in the array is a DBRef, but the data in the referenced collection is instead an ObjectId.
So if you wanted to use a $lookup approach, then you would first need to use plain ObjectId values in place of the existing DBRef values:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
"intervalAbsenceDates" : [
ObjectId("56eb0a06560fd7047318463b"),
ObjectId("56eb0a05560fd70473184467"),
ObjectId("56eb0a05560fd70473184468"),
ObjectId("56eb0a05560fd70473184469")
]
}
With data in that structure you could then use $lookup and other aggregation pipeline methods to just return the documents that actually match the value of a related property. I.e "end" within the related object:
db.InstitutionUserConnection.aggregate([
// Presently you need to unwind the array first
{ "$unwind": "$intervalAbsenceDates" },
// Then $lookup to get a resulting array of matches for each member
{ "$lookup": {
"from": "ScheduleIntervalContainer",
"localField": "intervalAbsenceDates",
"foreignField": "_id",
"as": "absenceDates"
}},
// unwind the array result field as well
{ "$unwind": "$absenceDates" },
// Now reform the documents
{ "$group": {
"_id": "$_id",
"intervalAbsenceDates": { "$push": "$absenceDates" }
}},
// Then query on the "end" property for the range
{ "$match": {
"intervalAbsenceDates": {
"$elemMatch": {
"end": {
"$gte": new Date("2016-03-23"),
"$lt": new Date("2016-03-24")
}
}
}
}}
])
The current behaviour of $lookup is that you cannot directly process on an array property in the document, so the procedure shown in "$lookup on ObjectId's in an array" is used to replace the current array with the expanded objects from the other collection.
Once the operations here actually produce a document that now has that related data embedded, it's a straightforward process of looking at the properties of the documents within the array to see if they match the query conditions.
Conclusion
This all should show that DBRef is not a good idea for storing references. Whist it is possible as demonstrated to work around the problem using the ObjectId values, you generally want to have a plain ObjectId or other key value as the reference and resolve them by other means. And even if the workaround is sufficient, it works just the same with plain ObjectId values or anything else that presents a natural range.
When it comes to using the "values" of refernced properties in such a "join", then of course regardless of using DBRef or other value, it is not a possibilty without $lookup to use that in query conditions on the server. All data would first need to be loaded to the client and then resolved with additional queries to the database before those properties can be inspected for filtering.
Since the mechanism of $lookup will in fact result in a form that will look exactly like if you "embedded" the data in the first place, then "embedding" is most often the correct approach, since the data is already present in the source collection and available for query.
There is a lot of "scare media" around regarding the BSON limit of 16MB and saying this is why you keep data in another collection. Sometimes this does indeed apply, but most of the time it does not. Afterall, 16MB is really quite a large amount of data, and more than most would actually use in general applications.
Citation from MongoDB: The Definitive Guide
To give you an idea of how much 16MB is, the entire text of War and Peace is just 3.14MB.
To examine for queries to need to get to an "embedded form" anyway, and it is arguable that if you can store an array of DBRef or ObjectId or whatever as embedded data, then storing all of the content they are actually pointing to is not really that much more of a stretch.
The general lesson is that you should be designing based on the actual usage patterns your applcation applies to the data. If you are querying "related data" all of the time, then it makes most sense to keep that data all in the one collection. Of course other factors apply, but always keep in mind the trade-off in performance considerations by what you are doing.

MongoDB Aggregation as slow as MapReduce?

I'm just starting out with mongo db and trying to make some simple things. I filled up my database with a collections of data containing the "item" property. I wanted to try to count how much time every item is in the collection
example of a document:
{ "_id" : ObjectId("50dadc38bbd7591082d920f0"), "item" : "Pons", "lines" : 37 }
So I designed these two functions for doing MapReduce (written in python using pymongo)
all_map = Code("function () {"
" emit(this.item, 1);"
"}")
all_reduce = Code("function (key, values) {"
" var sum = 0;"
" values.forEach(function(value){"
" sum += value;"
" });"
" return sum;"
"}")
This worked like a charm, so I began filling the collection. At around 30.000 documents, the mapreduce already lasts longer than a second... Because NoSQL is bragging about speed I thought I must have been doing something wrong!
A Question here at Stack Overflow made me check out the Aggregation feature of mongodb. So I tried to use the group + sum + sort thingies. Came up with this:
db.wikipedia.aggregate(
{ $group: { _id: "$item", count: { $sum: 1 } } },
{ $sort: {count: 1} }
)
This code works just fine and gives me the same results as the mapreduce set, but it is just as slow. Am I doing something wrong? Do I really need to use other tools like hadoop to get a better performance?
I will place an answer basically summing up my comments. I cannot speak for other techs like Hadoop since I have not yet had the pleasure of finding time to use them but I can speak for MongoDB.
Unfortunately you are using two of the worst operators for any database: computed fields and grouping (or distinct) on a full table scan. The aggregation framework in this case must compute the field, group and then in-memory ( http://docs.mongodb.org/manual/reference/aggregation/#_S_sort ) sort the computed field. This is an extremely inefficient task for MongoDB to perform, in fact most likely any database.
There is no easy way to do this in real-time in line to your own application. Map reduce could be a way out if you didn't need to return the results immediately but since I am guessing you don't really want to wait for this kind of stuff the default method is just to eradicate the group altogether.
You can do this by pre-aggregation. So you can create another collection of grouped_wikipedia and in your application you manage this using an upsert() with atomic operators like $set and $inc (to count the occurrences) to make sure you only get one row per item. This is probably the most sane method of solving this problem.
This does however raise another problem of having to manage this extra collection alongside the detail collection wikipedia but I believe this to be a unavoidable side effect of getting the right performance here. The benefits will be greater than the loss of having to manage the extra collection.

Apply function and sort in MongoDB without MapReduce

I have an interesting problem. I have a working M/R version of this but it's not really a viable solution in a small-scale environment since it's too slow and the query needs to be executed real-time.
I would like to iterate over each element in a collection and score it, sort by descending, limit to top 10 and return the results to the applications.
Here is the function I'd like applied to each document in pseudo code.
var score = 0;
foreach(tag in document.Tags) {
score += someMap[tag];
}
return score;
Since your someMap is changing each time, I don't see any alternative other than to score all the documents and return the highest-scoring ones. Whatever method you adopt for this type of operation, you'll have to consider all the documents in the collection, which is going to be slow, and will become more and more costly as the collection you're scanning grows.
One issue with map reduce is that each mongod instance can only run one concurrent map reduce. This is a limitation of the javascript engine, which is single-threaded. Multiple map reduces will be interleaved, but they cannot run concurrently with one another. This means that if you're relying on map reduce for "real-time" uses, that is, if your web page has to run a map reduce to render, you'll eventually hit a limit where page load times become unacceptably slow.
You can work around this by querying all the documents into your application, and doing the scoring, sorting, and limiting in your application code. Queries in MongoDB can run concurrently, unlike map reduce, though of course this means that your application servers will have to do a lot of work.
Finally, if you are willing to wait for MongoDB 2.2 to be released (which should be within a few months), you can use the new aggregation framework in place of map reduce. You'll have to massage the someMap to generate the correct pipeline steps. Here's an example of what this might look like if someMap were {"a": 5, "b": 2}:
db.runCommand({aggregate: "foo",
pipeline: [
{$unwind: "$tags"},
{$project: {
tag1score: {$cond: [{$eq: ["$tags", "a"]}, 5, 0]},
tag2score: {$cond: [{$eq: ["$tags", "b"]}, 3, 0]}}
},
{$project: {score: {$add: ["$tag1score", "$tag2score"]}}},
{$group: {_id: "$_id", score: {$sum: "$score"}}},
{$sort: {score: -1}},
{$limit: 10}
]})
This is a little complicated, and bears explaining:
First, we "unwind" the tags array, so that the following steps in the pipeline process documents where "tags" is a scalar -- the value of the tag from the array -- and all the other document fields (notably _id) are duplicated for each unwound element.
We use a projection operator to convert from tags to named score fields. The $cond/$eq expression for each roughly means (for the tag1score example) "if the value in the document in the 'tags' field id equal to 'a', then return 5 and assign that value to a new field tag1score, else return 0 and assign that". This expression would be repeated for each tag/score combination in your someMap. At this point in the pipeline, each document will nave N tagNscore fields, but at most one of them will have a non-zero value.
Next we use another projection operator to create a score field whose value is the sum of the tagNscore fields in the document.
Next we group the documents by their _id, and sum up the value of the score field from the previous step across all documents in each group.
We sort by score, descending (i.e. greatest scores first)
We limit to only the top 10 scores.
I'll leave it as an exercise to the reader how to convert someMap into the correct set of projections in step 2, and the correct set of fields to add in step 3.
This is essentially the same set of steps that your application code or map reduce would go through, but has the following distinct advantages: instead of map reduce, the aggregation framework is fully implemented in C++ and is faster and more concurrent than map reduce; and unlike querying all the documents to your application, the aggregation framework works with the data on the server side, saving network load. But like the other two approaches, this will still have to consider each document, and can only limit the result set once the score has been calculated for all of them.

How to set array length after updating it via $addToSet in mongodb?

Document structure looks like this,
{
blacklists:[] // elements should be unique
blacklistsLength:0 // length of blacklists
}
Adding sets of value to blacklists is easy.
db.posts.update({_id:...}, {$addtoSet:{blacklists:{$each:['peter', 'bob', 'steven']}}});
But How can I update blacklistLength at the same time to reflect the changes?
This is not possible. Either you have
Update the length seperately using a subsequent findAndModify
command or
You can do it per name and rewrite the query using a negation in
your criteria and $push rather than $addToSet (not necessarily
needed but a lot faster with large blacklists since addToSet is
always o(n) regardless of indexes) :
db.posts.update({_id:..., blacklists:{$ne:'peter'}}, {$push:{blacklists:{'peter'}},$inc:{blacklistsLength: 1}});
The latter being perfectly safe since the list and the length are adjusted atomically but obviously has slightly degraded performance. Since it also has the benefit of better overall performance due to the $push versus $addToSet performance issue on large arrays (and blacklists tend to become huge and remember that the $push version of the update uses an index on blacklist in the update criteria while $addToSet will NOT use an index during it's set scan) it is generally the best solution.
Would the following not work?
db.posts.update({_id:...}, {
$addtoSet:{blacklists:{$each:['peter', 'bob', 'steven']}},
$set: {blacklistsLength: ['peter', 'bob', 'steven'].length}
});
I had a similar problem, please see the discussion here: google groups mongo
As you can notice, following to this discussion, a bug was open:
Mongo Jira
As you upsert items into the database, simply query the item to see if it's in your embedded array. That way, you're avoiding pushing duplicate items, and only incrementing the counter as you add new items.
q = {'blacklists': {'$nin': ['blacklist_to_insert'] }}
u = {
'$push' : {'blacklists': { 'blacklist_to_insert' } },
'$inc' : {'total_blacklists': 1 }
}
o = { 'upsert' : true }
db.posts.update(q,u,o)