How to make distinct operation more quickly in mongodb - mongodb

There are 30,000,000 records in one collection.
when I use distinct command on this collection by java, it takes about 4 minutes, the result's count is about 40,000.
Is mongodb's distinct operation so inefficiency?
and how can I make it more efficient?

Is mongodb's distinct operation so inefficiency?
At 30m records? I would say 4 minutes is actually quite good, I think that's just as fast, maybe a little faster than SQL does it.
I would probably test this in other databases before saying it is inefficient.
However, one way of looking at performance is to see if the field is indexed first and if that index is in RAM or can be loaded without page thrashing. Distinct() can use an index so long as the field has an index.
and how can I make it more efficient?
You could use a couple of methods:
Incremental map reduce to distinct the main collection once every, say, 5 mins to a unique collection
And Pre-aggregate the unique collection on save by saving to two collections, one detail and one unique
Those are the two most viable methods of getting around this performantly.
Edit
Distinct() is not outdated and if it fits your needs is actually more performant than $group since it can use an index.

The .distinct() operation is an old one, as is .group(). In general these have been superseded by .aggregate() which should be generally used in preference to these actions:
db.collection.aggregate([
{ "$group": {
"_id": "$field",
"count": { "$sum": 1 }
}
)
Substituting "$field" with whatever field you wish to get a distinct count from. The $ prefixes the field name to assign the value.
Look at the documentation and especially $group for more information.

Related

Parse-Server source _rperm and _wperm mongo queries, and should I index them?

I'm going through my mongo logs to add indexes for my unindexed queries. By default, mongo only logs queries that take over 100ms to complete.
I've found that I have several on _wperm and _rperm keys. I see that is how the ACL gets broken down. But what type of Parse.Query call might create a query like this in the logs?
query: { orderby: {}, $query: { _rperm: { $in: [ null, "*", "[UserId]" ] } } }
I'm even noticing that this query is on a class that has only 8 total objects, yet is taking 133ms to complete, which seems really slow for such a small class, even if it had to do an in memory sort and scan.
Should I solve this at the code level, modifying my query to avoid this type of mongo query? Or should I add an index for these types of queries?
I notice I also have a few that are showing up in the Slow Queries tab on mLab. The query looks like {"_id":"<val>","_wperm":{"$in":["<vals>"]}}, with the suggested index {"_id": 1, "_wperm": 1}, but it has the following note:
"_id" is in the existing {"_id": 1} unique index. The following index recommendation should only be necessary in certain circumstances.
Yet, this is one of my slower queries, taking 320 ms to complete. It's on the _User class. Is that just because the _User class has a lot of rows? Since the _id is unique I feel like it shouldn't really make a difference adding a _wperm index, since I end up with only a single object.
I'm curious if I will see a benefit from taking action on these queries or if I should safely ignore them.
You should index your collections following the mongodb recommendations. In the good old parse.com days, those indexes were created automatically based on the seen workloads. Now you need to create them. Both make sense. The _rperm will be hit on every queries run without a masterKey. The _wperm in every write. In the future, we could automatically create the _id + _wperm index as all writes use this index

MongoDB returning latest documents from each category in one find statement?

Let's say I have a status and a createdAt field and a few other fields like title and content.
The status field can take on open, closed, or pending values.
I basically want to return the 5 latest documents from each status value in one find statement. Is this possible? Also, can this be done without aggregate?
If so, how?
Try following
db.collectionName.find().sort({"_id":-1}).limit(5).pretty()
It return's latest five documents
Bad news: that's not possible with basic queries. For a simple reason: you can only limit on a complete result set. So even when you use a logic OR on the status field, the first twenty documents of your result set may well all be of status "closed". When you apply whatever limit on this result set, up to the first twenty documents will have the status "closed".
I am not even sure wether this can be done easily with aggregation, I have to think about that.
Markus right. Unfortunatelly, it's not possible to do it in one aggregation query due to the lack of slice operation in a project phase of an aggregation pipeline.
It means, that you could easily group documents by status like this:
db.collectionName.aggregate([
{ "$group": {
"_id": "$status",
"docs": { "$push": "$title" }
}}
]);
BUT there is not way to slice resulting sub-arrays during this aggregation. This feature was requested, but not implemented yet.
You could slice it after aggregation using a separate command. Or you always have an option to use Map-Reduce.
Another option is to iterate over statuses in your application and query documents by each status. I would go with this solution in your scenario, because you have just 3 different statuses. It means only 3 queries. It is not fatal(especially if you have index on status field).

Iterating over distinct items in one field in MongoDB

I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.

how to build index in mongodb in this situation

I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.