We have a very big MongoDB collection of documents with some pre-defined fields that can either have a value or not.
We need to gather fill-rates of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.
Is there a way to use db.collection.aggregate or db.collection.mapReduce to run such a script server-side?
Should it have significant performance improvements?
Will it slow down other usages of that collection (e.g. holding a major lock)?
Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.
Here's how the old script worked (in node.js):
var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
var next = function() {
cursor.nextObject(function(err, doc) {
processDoc(doc, next);
});
};
next();
and this is the new script:
collection.mapReduce(
function () {
var processDoc = function(doc) {
...
};
processDoc(this);
},
function (key, values) {
return Array.sum(values)
},
{
query : query,
out: {inline: 1}
},
function (error, results) {
// print results
}
);
processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:
emit(field_name, 1);
running old and new on a sample of 100k, old took 20 seconds, new took 8.
some notes:
map-reduce's limit option doesn't work on sharded collections, I had to query for _id : { $gte, $lte} to create the sample size needed.
map-reduce's performance boost option: jsMode : true doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.
As I understood what you want to achieve is compute something on your documents, after that you have a new "document" that can be queried. You don't need to store the "new values" computed.
If you don't need to write your "new values" inside that documents, you can use Aggregation Framework.
Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
https://docs.mongodb.com/manual/aggregation/
Since Aggregation Framework has a lot of features i can't give you more informations about how to resolve your issue.
Related
I have the following code for querying ideas in my web app. I'm trying to figure out how I can optimize this code to run faster. Currently, for 100 entries it takes around 350ms, for 1000 entries it takes around 1000ms, and for 10,000 entries it takes around 6000ms for it to send a response. I've looked into indexing, but it doesn't seem to make it any faster. I used MongoLabs to create the index on tool, username, and rating_level for now. I was thinking about using Redis, but not sure how to implement it.
app.get('/api/ideas', function(req, res, next) {
var query;
if (req.query.tool) {
query = Idea.find({ tool : req.query.tool });
}
else if (req.query.username) {
query = Idea.find({ username : req.query.username });
}
else {
query = Idea.find();
}
if(req.query.sortRank) {
query = query.sort({rating_level: req.query.sortRank});
}
else if(req.query.sortDate) {
query = query.sort({datetime: req.query.sortDate});
}
else {
query = query.sort({rating_level: -1});
}
query.exec(function(err, ideas) {
if (err) return next(err);
res.send(ideas);
});
You are generating several different query patterns and sort combinations in your code:
tool, rating_level
tool, datetime
username, rating_level
username, datetime
(none), rating_level
(none), datetime
Performant queries will match the keys and sort order (asc/desc) defined for a compound index, or a proper subset of the keys from left to right.
If you only have a single index on "tool, username, rating_level", this will not efficiently support any of your queries as listed above.
The index you've suggested would be useful for queries on:
tool
tool, username
tool, username, rating_level
To return results without having to perform an in-memory sort, the sort order in the compound index should match the sort order in the queries.
For example:
db.collection.find({tool:1}).sort({rating_level: -1})
...would ideally want an index of
{tool:1, rating_level:-1}
Given it looks like your sort order could include both ascending & descending variations, there are potentially a lot of indexes required to cover all your query variations efficiently.
You'll have to consider the tradeoffs for query performance vs index maintenance and storage overhead.
Some approaches to consider include:
Add all the required indexes. With only 10,000 documents it's probably reasonable to do so.
Optimise index covered for your most common queries. Use the explain() command to get a better understanding of how queries are using indexes.
Adjust your data model to simplify indexes. For example, you could consider using a multikey array so that tool and username are key/value pairs that can be included in the same index.
Adjust your application user interface to simplify the number of query and sort permutations.
Some helpful references:
Use Indexes to Sort Query Results
Optimizing MongoDB Compound Indexes
Indexing Schemaless Documents in Mongo
I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.
What's the easiest way to get all the documents from a collection that are unique based on a single field.
I know I can use db.collections.distrinct to get an array of all the distinct values of a field, but I want to get the first (or really any one) document for every distinct value of one field.
e.g. if the database contained:
{number:1, data:'Test 1'}
{number:1, data:'This is something else'}
{number:2, data:'I'm bad at examples'}
{number:3, data:'I guess there\'s room for one more'}
it would return (based on number being unique:
{number:1, data:'Test 1'}
{number:2, data:'I'm bad at examples'}
{number:3, data:'I guess there\'s room for one more'}
Edit: I should add that the server is running Mongo 2.0.8 so no aggregation and there's more results than group will support.
Update to 2.4 and use aggregation :)
When you really need to stick to the old version of MongoDB due to too much red tape involved, you could use MapReduce.
In MapReduce, the map function transforms each document of the collection into a new document and a distinctive key. The reduce function is used to merge documents with the same distincitve key into one.
Your map function would emit your documents as-is and with the number-field as unique key. It would look like this:
var mapFunction = function(document) {
emit(document.number, document);
}
Your reduce-function receives arrays of documents with the same key, and is supposed to somehow turn them into one document. In this case it would just discard all but the first document with the same key:
var reduceFunction = function(key, documents) {
return documents[0];
}
Unfortunately, MapReduce has some problems. It can't use indexes, so at least two javascript functions are executed for every single document in the collections (it can be limited by pre-excluding some documents with the query-argument to the mapReduce command). When you have a large collection, this can take a while. You also can't fully control how the docments created by MapReduce are formed. They always have two fields, _id with the key and value with the document you returned for the key.
MapReduce is also hard to debug an troubleshoot.
tl;dr: Update to 2.4
in my db I have a collection where documents have a field score which is a float (-1..1). I can query the db to return the first 20 results ordered by score.
My problem is, that I want to modify the score of a doc with a time penality, based on the field time_updated: The older the doc is, the lower the score should be. And the big problem is, that I have to do this on runtime. I could iterate over all documents, update the score and then order by score. But this would cost too much time, since there is a huge amount of documents in the collection.
So my question is: With MongoDB, can I order by a computed property? Is there any way to do that? Or is there a feature in planning for next versions of MongoDB?
Exactly how is the score updated?
If it's simple and can be put in $add, $multiply, etc., terms then the aggregation pipeline will work well. Otherwise you'll need to use a simple MapReduce a for doing the the score updating.
var mapFunction = function() {
emit(this._id, <compute score here from this.score and this.time_updated>);
};
var reduceFunction = function (values) {
return values[0]; // trivial reduce function since incoming id's are unique.
};
For 10000 rows either the aggregation pipeline or a simple MapReduce will probably be sufficiently performant.
For much bigger datasets you may need to use a more complex MapReduce (that actually does a reduce) to be memory efficient. You might also want to take advantage of Incremental MapReduce.
I'm just starting out with mongo db and trying to make some simple things. I filled up my database with a collections of data containing the "item" property. I wanted to try to count how much time every item is in the collection
example of a document:
{ "_id" : ObjectId("50dadc38bbd7591082d920f0"), "item" : "Pons", "lines" : 37 }
So I designed these two functions for doing MapReduce (written in python using pymongo)
all_map = Code("function () {"
" emit(this.item, 1);"
"}")
all_reduce = Code("function (key, values) {"
" var sum = 0;"
" values.forEach(function(value){"
" sum += value;"
" });"
" return sum;"
"}")
This worked like a charm, so I began filling the collection. At around 30.000 documents, the mapreduce already lasts longer than a second... Because NoSQL is bragging about speed I thought I must have been doing something wrong!
A Question here at Stack Overflow made me check out the Aggregation feature of mongodb. So I tried to use the group + sum + sort thingies. Came up with this:
db.wikipedia.aggregate(
{ $group: { _id: "$item", count: { $sum: 1 } } },
{ $sort: {count: 1} }
)
This code works just fine and gives me the same results as the mapreduce set, but it is just as slow. Am I doing something wrong? Do I really need to use other tools like hadoop to get a better performance?
I will place an answer basically summing up my comments. I cannot speak for other techs like Hadoop since I have not yet had the pleasure of finding time to use them but I can speak for MongoDB.
Unfortunately you are using two of the worst operators for any database: computed fields and grouping (or distinct) on a full table scan. The aggregation framework in this case must compute the field, group and then in-memory ( http://docs.mongodb.org/manual/reference/aggregation/#_S_sort ) sort the computed field. This is an extremely inefficient task for MongoDB to perform, in fact most likely any database.
There is no easy way to do this in real-time in line to your own application. Map reduce could be a way out if you didn't need to return the results immediately but since I am guessing you don't really want to wait for this kind of stuff the default method is just to eradicate the group altogether.
You can do this by pre-aggregation. So you can create another collection of grouped_wikipedia and in your application you manage this using an upsert() with atomic operators like $set and $inc (to count the occurrences) to make sure you only get one row per item. This is probably the most sane method of solving this problem.
This does however raise another problem of having to manage this extra collection alongside the detail collection wikipedia but I believe this to be a unavoidable side effect of getting the right performance here. The benefits will be greater than the loss of having to manage the extra collection.