Is searching by _id in mongoDB more efficient? - mongodb

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.

Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )

Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

Related

Mongo DB, document count mismatch for a collection

I have a collection User in mongo. When I do a count on this collection I got 13204951 documents
> db.User.count()
13204951
But when I tried to find the count of non-stale documents like this I got a count of 13208778
> db.User.find({"_id": {$exists: true, $ne: null}}).count()
13208778
> db.User.find({"UserId": {$exists: true, $ne: null}}).count()
13208778
I even tried to get the count of this collection using MongoEngine
user_list = set(User.objects().values_list('UserId'))
len(resume_list)
13208778
Here are the indexes of this User collection
>db.User.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "user_db.User"
},
{
"v" : 1,
"unique" : true,
"key" : {
"UserId" : 1
},
"name" : "UserId_1",
"ns" : "user_db.User",
"sparse" : false,
"background" : true
}
]
Any pointers on how to debug the mismatch in counts from different queries.
refer to this document
On a sharded cluster, db.collection.count() can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress.
Also, refer to this question
If you are not using sharding cluster, you can refer to this question
The basic idea is db.{collection}.count() might do some tricks to make it fast to return a count, and it might be not accurate, use a count() with query should be accurate.

Can sorting before grouping improve query performance in Mongo using the aggregate framework?

I'm trying to aggregate data for 100 accounts for a 14-15 month period, grouping by year and month.
However, the query performance is horrible as it takes 22-27 seconds. There are currently over 15 million records in the collection and I've got an index on the match criteria and can see using explain() that the optimizer uses it.
I tried adding another index on the sort criteria in the query below and after adding the index, the query now takes over 50 seconds! This happens even after I remove the sort from the query.
I'm extremely confused. I thought because grouping can't utilize an index, that if the collection was sorted beforehand, then the grouping could be much faster. Is this assumption correct? If not, what other options do I have? I can bear the query performance to be as much as 5 seconds but nothing more than that.
//Document Structure
{
Acc: 1,
UIC: true,
date: ISODate("2015-12-01T05:00:00Z"),
y: 2015
mm: 12
value: 22.3
}
//Query
db.MyCollection.aggregate([
{ "$match" : { "UIC" : true, "Acc" : { "$in" : [1, 2, 3, ..., 99, 100] }, "date" : { "$gte" : ISODate("2015-12-01T05:00:00Z"), "$lt" : ISODate("2017-02-01T05:00:00Z") } } },
//{ "$sort" : { "UIC" : 1, "Acc" : 1, "y" : -1, "mm" : 1 } },
{ "$group" : { "_id" : { "Num" : "$Num", "Year" : "$y", "Month" : "$mm" }, "Sum" : { "$sum" : "$value" } } }
])
What I would suggest you to do is to make a script (can be in nodejs) that aggregates the data in a different collection. When you have these long queries, what's advisable is to make a different collection containing the aggregation data and query from that.
My second advice would be to create a composed index in this aggregated collection and search by regular expression. In your case I would make an index containing accountId:period. For example, for account 1, and February of 2016, The index would be something like 1:201602.
Then you would be able to perform queries using regular expressions by account and timestamp. Like as if you wanted the registers for 2016 of account 1, you could do something like:
db.aggregatedCollection.find{_id : \1:2016\})
Hope my answer was helpful

MongoDB - aggregation performance tuning

One of my aggregation pipelines is running rather slow.
About the collection
The collection is named as Document and each document can belong to multiple campaigns and be in one of the five statues, 'a' to 'e'. A small portion of documents may belong to no documents and its campaigns field is set to null.
Sample document:
{_id:id, campaigns:['c1', 'c2], status:'a', ...other fields...}
Some collection stats
Number of documents: 2 million only :(
Size: 2GB
Average document size: 980 bytes.
Storage Size: 780MB
Total index size: 134MB
Number of indexes: 12
Number of fields in document: 30-40, may have array or objects.
About the Query
The query is targeting to count the number of documents per campaign per status if its status is in ['a', 'b', 'c']
[
{$match:{campaigns:{$ne:null}, status:{$in:['a','b','c']}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}
]
It's expected that the aggregation is going to hit almost the whole collection.
When without index the aggregation is taking around 8 seconds to complete.
I tried to create an index on
{campaings:1, status:1}
Explaining plan shows that the index was scanned but the aggregation took near 11 seconds to complete.
Question
The index consists all fields required by the aggregation to do the counting. Should the aggregation not hit the index only? The index is only 10MB in size. How could it be slower? If not index, any other recommendation to tune the query?
Winning plan shows:
{
"stage" : "FETCH",
"filter" : {"$not" : {"campaigns" : {"$eq" : null}}},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {"campaigns" : 1.0,"status" : 1.0},
"indexName" : "campaigns_1_status_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"campaigns" : ["[MinKey, null)", "(null, MaxKey]"],
"status" : [ "[\"a\", \"a\"]", "[\"b\", \"b\"]", "[\"c\", \"c\"]"]
}
}
}
If no index, winning plan:
{
"stage" : "COLLSCAN",
"filter" : {
"$and":[
{"status": {"$in": ["a", "b", "c"]}},
{"$not" : {"campaigns": {"$eq" : null}}}
]
},
direction" : "forward"
}
Update
As requested by #Kevin, here're some details about other all indexes, size in MB.
"indexSizes" : {
"_id_" : 32,
"team_1" : 8, //Single value field of ObjectId
"created_time_1" : 16, //Document publish time in source system.
"parent_1" : 2, //_id of parent document.
"by.id_1" : 13, //_id of author from a different collection.
"feedids_1" : 8, //Array, _id of ETL jobs contributing to sync of this doc.
"init_-1" : 2, //Initial load time of the doc.
"campaigns_1" : 10, //Array, _id of campaigns
"last_fetch_-1" : 13, //Last sync time of the doc.
"categories_1" : 8, //Array, _id of document categories.
"status_1" : 8, //Status
"campaigns_1_status_1" : 10 //Combined index of campaign _id and status.
},
After reading the docs from MongoDB, I found this:
The inequality operator $ne is not very selective since it often matches a large portion of the index. As a result, in many cases, a $ne query with an index may perform no better than a $ne query that must scan all documents in a collection. See also Query Selectivity.
Looking at a few different articles using the $type operator might solve the problem.
Could you use this query:
db.data.aggregate([
{$match:{campaigns:{$type:2},status:{$in:["a","b","c"]}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}])

Optimize query performance in MongoDB

I have a collection named App and need to query those active (active: true) apps that belong to a particular user (user_id) or are available to all users (by their _id). I use query like this
{
"active" : true,
"$or" : [
{
"user_id" : "111111111111111111111111"
},
{
"_id" : {
"$in" : [
ObjectId("222222222222222222222222"),
ObjectId("333333333333333333333333"),
ObjectId("444444444444444444444444")
]
}
}
]
}
However in db.currentOp(true) I see that this query is running very slowly: lockStats.timeLockedMicros.r is about 3000.
How can I optimize performance of this query? I already have the following indexes on App:
> db.App.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.App"
},
{
"v" : 1,
"key" : {
"active" : 1,
"created_at" : -1
},
"name" : "active_1_created_at_-1",
"ns" : "mydb.App",
"background" : true
},
{
"v" : 1,
"key" : {
"active" : 1,
"user_id" : 1
},
"name" : "active_1_user_id_1",
"ns" : "mydb.App",
"background" : true
}
]
Two issues I see here:
1) You would not need index on the boolean field active as it would have low selectivity and not benefiting query performance.
"If overall selectivity is low, and if MongoDB must read a number of documents to return results, then some queries may perform faster without indexes." source
2) You need an index for user_id because user_id cannot use the compound index you created for active_1_user_id_1
Edit: You can always check index efficiency by doing a explain(true) and look at which indexes are used for that query.
I would try to do the following:
remove all your indexes, your active field has a low cardinality (boolean) and does not help you at all, you are not using created_at, so there is no reason for it.
add an index only on user_id key
change your strings as numbers to numbers.

Using Spring Data Mongodb, is it possible to get the max value of a field without pulling and iterating over an entire collection?

Using mongoTemplate.find(), I specify a Query with which I can call .limit() or .sort():
.limit() returns a Query object
.sort() returns a Sort object
Given this, I can say Query().limit(int).sort(), but this does not perform the desired operation, it merely sorts a limited result set.
I cannot call Query().sort().limit(int) either since .sort() returns a Sort()
So using Spring Data, how do I perform the following as shown in the mongoDB shell? Maybe there's a way to pass a raw query that I haven't found yet?
I would be ok with extending the Paging interface if need be...just doesn't seem to help any. Thanks!
> j = { order: 1 }
{ "order" : 1 }
> k = { order: 2 }
{ "order" : 2 }
> l = { order: 3 }
{ "order" : 3 }
> db.test.save(j)
> db.test.save(k)
> db.test.save(l)
> db.test.find()
{ "_id" : ObjectId("4f74d35b6f54e1f1c5850f19"), "order" : 1 }
{ "_id" : ObjectId("4f74d3606f54e1f1c5850f1a"), "order" : 2 }
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
> db.test.find().sort({ order : -1 }).limit(1)
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
You can do this in sping-data-mongodb. Mongo will optimize sort/limit combinations IF the sort field is indexed (or the #Id field). This produces very fast O(logN) or better results. Otherwise it is still O(N) as opposed to O(N*logN) because it will use a top-k algorithm and avoid the global sort (mongodb sort doc). This is from Mkyong's example but I do the sort first and set the limit to one second.
Query query = new Query();
query.with(new Sort(Sort.Direction.DESC, "idField"));
query.limit(1);
MyObject maxObject = mongoTemplate.findOne(query, MyObject.class);
Normally, things that are done with aggregate SQL queries, can be approached in (at least) three ways in NoSQL stores:
with Map/Reduce. This is effectively going through all the records, but more optimized (works with multiple threads, and in clusters). Here's the map/reduce tutorial for MongoDB.
pre-calculate the max value on each insert, and store it separately. So, whenever you insert a record, you compare it to the previous max value, and if it's greater - update the max value in the db.
fetch everything in memory and do the calculation in the code. That's the most trivial solution. It would probably work well for small data sets.
Choosing one over the other depends on your usage of this max value. If it is performed rarely, for example for some corner reporting, you can go with the map/reduce. If it is used often, then store the current max.
As far as I am aware Mongo totally supports sort then limit: see http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
Get the max/min via map reduce is going to be very slow and should be avoided at all costs.
I don't know anything about Spring Data, but I can recommend Morphia to help with queries. Otherwise a basic way with the Java driver would be:
DBCollection coll = db.getCollection("...");
DBCursor curr = coll.find(new BasicDBObject()).sort(new BasicDBObject("order", -1))
.limit(1);
if (cur.hasNext())
System.out.println(cur.next());
Use aggregation $max .
As $max is an accumulator operator available only in the $group stage, you need to do a trick.
In the group operator use any constant as _id .
Lets take the example given in Mongodb site only --
Consider a sales collection with the following documents:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
If you want to find out the max price among all the items.
db.sales.aggregate(
[
{
$group:
{
_id: "1", //** This is the trick
maxPrice: { $max: "$price" }
}
}
]
)
Please note that the value of "_id" - it is "1". You can put any constant...
Since the first answer is correct but the code is obsolete, I'm replying with a similar solution that worked for me:
Query query = new Query();
query.with(Sort.by(Sort.Direction.DESC, "field"));
query.limit(1);
Entity maxEntity = mongoTemplate.findOne(query, Entity.class);