Querying a Sub Object in MongoDB is not using the Index - mongodb

I am recording site usage events in a sub object of a (visitor). here is a basic example of the data structure:
{ "_id" : ObjectId("4d4c695794b332a0740009bd"), "evs" : [
{
"ev" : "Visit Home Page",
"d" : 1,
"s" : 1
},
{
"ev" : "Buy Product",
"d" : "110.10",
"upc" : 1234,
"s" : 1
},
{
"ev" : "Sign up to newsletter",
"d" : "1",
"s" : 1
}
]}
I have an index on 'evs.s', but when I search on evs.s, the index is not used:
db.visitors.find({'evs.s':0}).explain()
{
"cursor" : "BtreeCursor evs.s_1",
"nscanned" : 33361,
"nscannedObjects" : 33361,
"n" : 33361,
"millis" : 311,
"nYields" : 105,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"evs.s" : [
[
0,
0
]
]
}
}
That query takes 311 milliseconds and scans through every object.
Here is the index: db.visitors.getIndexes()
{
"ns" : "tracking.visitors",
"unique" : false,
"key" : {
"evs.s" : 1
},
"name" : "evs.s_1",
"v" : 0
}

Your query actually is using an index, as indicated by the cursor type in the explain output ("BtreeCursor evs.s_1"). If you were not using a an index, it would be "BasicCursor".
From your input data, it looks like evs.s might not be a very efficient key to index on. If all of the values of evs.s are either 1 or 0, your index will always hit a large number of matches.
My guess is that your query did not do a full table scan, but that there are actually that many records with a value of evs.s = 0 in your index.
You might compare the output of
db.visits.find({evs.s: 0}).count();
db.visits.find({evs.s: 1}).count();
db.visits.find().count();
to verify this.
There are several things you can do to speed this up:
1) You can use a different index that has more distinct values. This will reduce the search space on each query.
2) You can add a limit statement to your query. This will stop scanning the index once limit documents have been found.

"cursor" : "BtreeCursor evs.s_1"
means that the index is used.

Related

Getting rid of _id in mongodb collection

I know it is not possible to remove the _id field in a mongodb collection. However, the size of my collections is large, that the index on the _id field prevents me from loading the other indices in the RAM. My machine has 125GB of RAM and my collection stats is as follows:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
When I do a query like the following:
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
it takes more than 15 seconds on average to return the results. The indices for both caller_id and receiver_id should be around 90GB, which is OK. However, the 73GB index on the _id makes this query very slow.
You correctly told that you can not remove _id field from your document. You also can not remove an index from this field, so this is something you have to live with.
For some reason you start with the assumption that _id index makes your query slow, which is completely unjustifiable and most probably is wrong. This index is not used and just stays there untouched.
Few things I would try to do in your situation:
You have 400 billion documents in your collection, have you thought that this is a right time to start sharding your database? In my opinion you should.
use explain with your query to actually figure out what slows it down.
Looking at your query, I would also try to do the following:
change your document from
{
... something else ...
receiver_id: 234,
caller_id: 342
}
to
{
... something else ...
participants: [342, 234]
}
where your participants are [caller_id, receiver_id] in this order, then you can put only one index on this field. I know that it will not make your indices smaller, but I hope that because you will not use $or clause, you will get results faster. P.S. if you will do this, do not do this in production, test whether it give you a significant improvement and only then change in prod.
There are a lot of potential issues here.
The first is that your indexes do not include all of the data returned. This means Mongo is getting the _id from the index and then using the _id to retrieve and return the document in question. So removing the _id index, even if you could, would not help.
Second, the query includes an OR. This forces Mongo to load both indexes so that it can read them and then retrieve the documents in question.
To improve performance, I think you have just a few choices:
Add the additional elements to the indexes and restrict the data returned to what is available in the index (this would change indexOnly = true in the explain results)
Explore sharding as Skooppa.com mentioned.
Rework the query and/or the document to eliminate the OR condition.

MongoDB fulltext search not using index

We use mongoDB fulltext search to find products in our database.
Unfortunately it is incredible slow.
The collection contains 89.114.052 documents and I have the suspicion, that the full text index is not used.
Performing a search with explain(), nscannedObjects returns 133212.
Shouldn't this be 0 if an index is used?
My index:
{
"v" : 1,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "textIndex",
"ns" : "search.products",
"weights" : {
"brand" : 1,
"desc" : 1,
"ean" : 1,
"name" : 3,
"shop_product_number" : 1
},
"default_language" : "german",
"background" : false,
"language_override" : "language",
"textIndexVersion" : 2
}
The complete test search:
> db.products.find({ $text: { $search: "playstation" } }).limit(100).explain()
{
"cursor" : "TextCursor",
"n" : 100,
"nscannedObjects" : 133212,
"nscanned" : 133212,
"nscannedObjectsAllPlans" : 133212,
"nscannedAllPlans" : 133212,
"scanAndOrder" : false,
"nYields" : 1041,
"nChunkSkips" : 0,
"millis" : 105,
"server" : "search2:27017",
"filterSet" : false
}
Please have a look at the question you asked:
".... The collection contains 89.114.052 documents and I have the suspicion, that the full text index is not used ...."
You are only "nScanned" for 133212 documents. Of course the index is used. If it was not then 89,114,052 documents ( because this is English locale and not German ) would have otherwise been reported in "nScanned" which means an index is not used.
Your query is slow. Well it seems your hardware is not up to the task of keeping 1333212 documents in memory or otherwise having the super fast disk to "page" effectively. But this is not a MongoDB problem but yours.
You have over 100,000 documents that match your query and even if you just want 100 then you need to accept this is how this works and MongoDB does not "give up" once you have matched 100 documents and yield control. The query pattern here finds all of the matches and then applies the "limit" to the cursor in order just to return the most recent.
Maybe some time in the future the "text" functionality might allow you do do things like you can do in the aggregate version of $geoNear and specify "minimum" and "maximum" values for a "score" in order to improve results. But right now it does not.
So either upgrade your hardware or use an external text search solution if your problem is the slow results on matching over 100,000 documents out of over 89,000,000 documents.

mongodb query should be covered by index but is not

the query:
db.myColl.find({"M.ST": "mostrepresentedvalueinthecollection", "M.TS": new Date(2014,2,1)}).explain()
explain output :
"cursor" : "BtreeCursor M.ST_1_M.TS_1",
"isMultiKey" : false,
"n" : 587606,
"nscannedObjects" : 587606,
"nscanned" : 587606,
"nscannedObjectsAllPlans" : 587606,
"nscannedAllPlans" : 587606,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 9992,
"nChunkSkips" : 0,
"millis" : 174820,
"indexBounds" : {
"M.ST" : [
[
"mostrepresentedvalueinthecollection",
"mostrepresentedvalueinthecollection"
]
],
"M.TS" : [
[
ISODate("2014-03-01T00:00:00Z"),
ISODate("2014-03-01T00:00:00Z")
]
]
},
"server" : "myServer"
additional details: myColl contains about 40m documents, average object size is 300b.
I don't get why indexOnly is not set to true, I have a compound index on {"M.ST":1, "M.TS":1}
The mongo host is a unix box with 16gb RAM and 500gb disk space (spinning disk).
The total index size of the database is 10gb, we've got around 1k upserts/sec, on those 1K 20 are inserts the rest are Increments.
We have another query that adds a third field in the find query (called "M.X"), and also a compound index on "M.ST", "M.X", "M.TS". That one is lightning fast and scans only 330 documents.
Any idea what could be wrong ?
Thanks.
EDIT : here's the structure of a sample document:
{
"_id" : "somestring",
"D" : {
"20140301" : {
"IM" : {
"CT" : 143
}
},
"20140302" : {
"IM" : {
"CT" : 44
}
},
"20140303" : {
"IM" : {
"CT" : 206
}
},
"20140314" : {
"IM" : {
"CT" : 5
}
}
},
"Y" : "someotherstring",
"IM" : {
"CT" : 1
},
"M" : {
"X" : 99999,
"ST" : "mostrepresentedvalueinthecollection",
"TS" : ISODate("2014-03-01T00:00:00.000Z")
},
}
The idea is to store some analytics metrics by month, the "D" field represents an array of documents containing data for each day of the month.
EDIT:
This feature is not currently implemented. Corresponding JIRA ticket is SERVER-2104. You can upvote for it, but for now, to utilize covered index queries you need to avoid use of dot-notation/embedded document.
I think you need to set a projection on that query, to tell mongo what indexes it covers.
Try this..
db.myColl.find({"M.ST": "mostrepresentedvalueinthecollection", "M.TS": new Date(2014,2,1)},{ M.ST:1, M.TS:1, _id:0 }).explain()

Searching for ranges in mongo

What's the most efficient way to find data in Mongo, when the input data is a single value, and the collection data contains min/max ranges? E.g:
record = { min: number, max: number, payload }
Need to locate a record for a number that falls within the min/max range of the record. The ranges never intersect. There is no predictability about the size of the ranges.
The collection has ~6M records in it. If I unpack the ranges (have records for each value in range), I would be looking at about 4B records instead.
I've created the compound index of {min:1,max:1}, but attempt to search using:
db.block.find({min:{$lte:value},max:{$gte:value})
... takes anywhere from few to tens of seconds. Below are the output of explain() and getIndexes(). Is there any trick I can apply to make the search execute significantly faster?
NJmongo:PRIMARY> db.block.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "mispot.block",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"min" : 1,
"max" : 1
},
"ns" : "mispot.block",
"name" : "min_1_max_1"
}
]
NJmongo:PRIMARY> db.block.find({max:{$gte:1135194602},min:{$lte:1135194602}}).explain()
{
"cursor" : "BtreeCursor min_1_max_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1199049,
"nscannedObjectsAllPlans" : 1199050,
"nscannedAllPlans" : 2398098,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 7534,
"nChunkSkips" : 0,
"millis" : 5060,
"indexBounds" : {
"min" : [
[
-1.7976931348623157e+308,
1135194602
]
],
"max" : [
[
1135194602,
1.7976931348623157e+308
]
]
},
"server" : "ccc:27017"
}
If the ranges of your block records never overlap, then you can accomplish this much faster with:
db.block.find({min:{$lte:value}}).sort({min:-1}).limit(1)
This query will return almost instantly since it can find the record with a simple lookup in the index.
The query you are running is slow because the two clauses each match on millions of records that must be merged. In fact, I think your query would run faster (maybe much faster) with separate indexes on min and max since the max part of your compound index can only be used for a given min -- not to search for documents with a specific max.

Understanding an index on an array of subdocuments

I've been looking into array (multi-key) indexing on MongoDB and I have the following questions that I haven't been able to find much documentation on:
Indexes on an array of subdocuments
So if I have an array field that looks something like:
{field : [
{a : "1"},
{b : "2"},
{c : "3"}
]
}
I am querying only on field.a and field.c individually (not both together), I believe I have a choice between the following alternatives:
db.Collection.ensureIndex({field : 1});
db.Collection.ensureIndex({field.a : 1});
db.Collection.ensureIndex({field.c : 1});
That is: an index on the entire array; or two indexes on the embedded fields. Now my questions are:
How do you visualize an index on the entire array in option 1 (is it even useful)? What queries is such an index useful for?
Given the querying situation I have described, which of the above two options is better, and why?
You are correct that if you are querying only on the value of a in the field array, both indexes will, in a sense, help you make your query more performant.
However, have a look at the following 3 queries:
> db.zaid.save({field : [{a: 1}, {b: 2}, {c: 3}] });
> db.zaid.ensureIndex({field:1});
> db.zaid.ensureIndex({"field.a":1});
#Query 1
> db.zaid.find({"field.a":1})
{ "_id" : ObjectId("50b4be3403634cff61158dd0"), "field" : [ { "a" : 1 }, { "b" : 2 }, { "c" : 3 } ] }
> db.zaid.find({"field.a":1}).explain();
{
"cursor" : "BtreeCursor field.a_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"field.a" : [
[
1,
1
]
]
}
}
#Query 2
> db.zaid.find({"field.b":1}).explain();
{
"cursor" : "BasicCursor",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
#Query 3
> db.zaid.find({"field":{b:1}}).explain();
{
"cursor" : "BtreeCursor field_1",
"nscanned" : 0,
"nscannedObjects" : 0,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"field" : [
[
{
"b" : 1
},
{
"b" : 1
}
]
]
}
}
Notice that the second query doesn't have an index on it, even though you indexed the array, but the third query does. Choosing your indexes based on how you intend to query your data is as important as considering whether the index itself is what you need. In Mongo, the structure of your index can and does make a very large difference on the performance of your queries if you aren't careful. I think that explains your first question.
Your second question is a bit more open ended, but I think the answer, again, lies in how you expect to query your data. If you will only ever be interested in matching on values of "fields.a", then you should save room in memory for other indexes which you might need down the road. If, however, you are equally likely to query on any of those items in the array, and you are reasonably certain that the array will no grow infinitely (never index on an array that will potentially grow over time to an unbound size. The index will be unable to index documents once the array reaches 1024 bytes in BSON.), then you should index the full array. An example of this might be a document for a hand of playing cards which contains an array describing each card in a users hand. You can index on this array without fear of overflowing beyond the index size boundary since a hand could never have more than 52 cards.