Mongo not using index - mongodb

I have the following indexes within a collection:
db.JobStatusModel.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "jobs.JobStatusModel"
},
{
"v" : 1,
"key" : {
"peopleId" : 1,
"jobId" : 1
},
"name" : "peopleId_jobId_compounded",
"ns" : "jobs.JobStatusModel"
},
{
"v" : 1,
"key" : {
"jobId" : 1
},
"name" : "jobId_1",
"ns" : "jobs.JobStatusModel",
"background" : true
},
{
"v" : 1,
"key" : {
"peopleId" : 1,
"disInterested" : 1
},
"name" : "peopleId_1_disInterested_1",
"ns" : "jobs.JobStatusModel",
"background" : true
}
]
Trying to work out some slow running queries running against the compound indexes, however, even simple queries aren't making use of indexes:
db.JobStatusModel.find({ jobId : '1f940601ff7385931ec04dca88c853dd' }).explain(true)
{
"cursor" : "BtreeCursor jobId_1",
"isMultiKey" : false,
"n" : 221,
"nscannedObjects" : 221,
"nscanned" : 221,
"nscannedObjectsAllPlans" : 221,
"nscannedAllPlans" : 221,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"jobId" : [
[
"1f940601ff7385931ec04dca88c853dd",
"1f940601ff7385931ec04dca88c853dd"
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor jobId_1",
"isMultiKey" : false,
"n" : 221,
"nscannedObjects" : 221,
"nscanned" : 221,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"jobId" : [
[
"1f940601ff7385931ec04dca88c853dd",
"1f940601ff7385931ec04dca88c853dd"
]
]
}
}
],
"server" : "mongo3.pilot.dice.com:27017",
"filterSet" : false,
"stats" : {
"type" : "FETCH",
"works" : 222,
"yields" : 1,
"unyields" : 1,
"invalidates" : 0,
"advanced" : 221,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 1,
"alreadyHasObj" : 0,
"forcedFetches" : 0,
"matchTested" : 0,
"children" : [
{
"type" : "IXSCAN",
"works" : 222,
"yields" : 1,
"unyields" : 1,
"invalidates" : 0,
"advanced" : 221,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 1,
"keyPattern" : "{ jobId: 1.0 }",
"isMultiKey" : 0,
"boundsVerbose" : "field #0['jobId']: [\"1f940601ff7385931ec04dca88c853dd\", \"1f940601ff7385931ec04dca88c853dd\"]",
"yieldMovedCursor" : 0,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0,
"keysExamined" : 221,
"children" : [ ]
}
]
}
}
as we can see from the output I am getting the "indexOnly" : false, from the output meaning it cannot just do an index scan even though my field is indexed. How can I ensure queries are running only against indexes?

even simple queries aren't making use of indexes:
Your query did use an index as indicated by the IXSCAN stage and index cursor ("cursor" : "BtreeCursor jobId_1",).
Trying to work out some slow running queries running against the compound indexes, however,
Based on the provided getIndexes() output, your query on the single field jobId only has one candidate index to consider: {jobId:1}. This query ran in 1 millisecond ("millis" : 1) and returned 221 documents looking at 221 index keys -- an ideal 1:1 hit ratio for key comparisons to matches.
The compound index of {peopleId:1, jobId:1} would only be considered if you also provided a peopleId value in your query. However, you could potentially create a compound index with these fields in the opposite order if you sometimes query solely on jobId but also frequently query on both peopleId and jobId. A compound index on {jobId:1, peopleId:1} would obviate the need for the {jobId:1} index since it could satisfy the same queries.
For more information see Create Indexes to Support Your Queries in the MongoDB manual and the blog post Optimizing MongoDB Compound Indexes.
Note: You haven't mentioned what version of MongoDB server you are using but the format of your explain() output indicates that you're running an older version of MongoDB that has reached End-of-Life (i.e. anything older than MongoDB 3.0 as at Jan-2017). I strongly recommend upgrading to a newer and supported version (eg. MongoDB 3.2 or 3.4) as there are significant improvements. End-of-Life server release series are no longer maintained and may potentially expose your application to known bugs and vulnerabilities that have been addressed in subsequent production releases.
as we can see from the output I am getting the "indexOnly" : false, from the output meaning it cannot just do an index scan even though my field is indexed. How can I ensure queries are running only against indexes?
The indexOnly value will only be true in the special case of a covered query. A covered query is one where all of the fields in the query are part of an index and all of the fields projected in the results are in the same index. Typically indexed queries are not covered: index lookups are used to find matching documents which are then retrieved and filtered to the fields requested in the query projection.

In order to be sure you get indexOnly you need to return only those fields from the index, use projection:
db.collection.find( <query filter>, <projection> )
db.JobStatusModel.find({ jobId : '1f940601ff7385931ec04dca88c853dd' }, {jobId:1, _id:0})

Related

MongoDB object field and range query index

I have the following structure in the database:
{
"_id" : {
"user" : 14197,
"date" : ISODate("2014-10-24T00:00:00.000Z")
},
...
}
I have a performance problem when I try to select data by user & date-range. Monogo doesn't use index & runs full-scan over collection.
db.timeuse.daily.find({ "_id.user": 289006, "_id.date" : {$gt: ISODate("2014-10-23T00:00:00Z"), $lte: ISODate("2014-10-30T00:00:00Z")}}).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 6,
"nscannedObjects" : 66967,
"nscanned" : 66967,
"nscannedObjectsAllPlans" : 66967,
"nscannedAllPlans" : 66967,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 523,
"nChunkSkips" : 0,
"millis" : 1392,
"server" : "mongo-shard0003:27018",
"filterSet" : false,
"stats" : {
"type" : "COLLSCAN",
"works" : 66969,
"yields" : 523,
"unyields" : 523,
"invalidates" : 16,
"advanced" : 6,
"needTime" : 66962,
"needFetch" : 0,
"isEOF" : 1,
"docsTested" : 66967,
"children" : [ ]
},
"millis" : 1392
}
So far I found only one way - use $in.
db.timeuse.daily.find({"_id": { $in: [
{"user": 289006, "date": ISODate("2014-10-23T00:00:00Z")},
{"user": 289006, "date": ISODate("2014-10-24T00:00:00Z")}
]}}).explain()
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 2,
"nscannedObjects" : 2,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
{
"user" : 289006,
"date" : ISODate("2014-10-23T00:00:00Z")
},
{
"user" : 289006,
"date" : ISODate("2014-10-23T00:00:00Z")
}
],
[
{
"user" : 289006,
"date" : ISODate("2014-10-24T00:00:00Z")
},
{
"user" : 289006,
"date" : ISODate("2014-10-24T00:00:00Z")
}
]
]
},
If there's a more elegant way to run this kind of query?
TL;DR: Don't put your data in the _id field and use a compound index: db.timeuse.daily.ensureIndex( { "user" : 1, "date": 1 } ).
Explanation:
You're abusing the _id key convention, or more precisely the fact that MongoDB can index entire objects. What you want to achieve requires index intersection or a compound index, that is, either two separate indexes that can be combined (that feature is called index intersection and by now, it should be available in MongoDB, but it has limitations) or a special index for the set of keys which in MongoDB is called a compound index.
The _id field is indexed by default, but it's indexed as a whole, i.e. the _id index with only support equality queries on the entire object, rather than parts of the object. That also explains why the $in query works.
In general, that data structure with the default index will behave oddly. Consider this:
> db.sort.insert({"_id" : {"name" : "foo", value : 1} });
> db.sort.insert({"_id" : {"name" : "foo", value : 1, bla : "foo"} });
> db.sort.find();
{ "_id" : { "name" : "foo", "value" : 4343 } }
{ "_id" : { "name" : "foo", "value" : 4343, "bla" : "fooffo" } }
> db.sort.find({"_id" : {"name" : "foo", value : 4343} });
{ "_id" : { "name" : "foo", "value" : 4343 } }
// no second result here...
Imagine MongoDB basically hashed the entire object and was simply looking for the object hash - such an index can't support range queries based on some part of the hash.

Mongodb query excessively slow

I have a collection of tweets, with indexes on userid and tweeted_at (date). I want to find the dates of the oldest and newest tweets in the collection for a user, but the query runs very slowly.
I used explain, and here's what I got. I tried reading the documentation for explain, but I don't understand what is going on here. Is the explain just on the sort? If so, why does it take so long when it's using the index?
> db.tweets.find({userid:50263}).sort({tweeted_at:-1}).limit(1).explain(1)
{
"cursor" : "BtreeCursor tweeted_at_1 reverse",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 12705,
"nscanned" : 12705,
"nscannedObjectsAllPlans" : 12705,
"nscannedAllPlans" : 12705,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 188,
"nChunkSkips" : 0,
"millis" : 7720,
"indexBounds" : {
"tweeted_at" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor tweeted_at_1 reverse",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 12705,
"nscanned" : 12705,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tweeted_at" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
}
}
],
"server" : "adams-server:27017",
"filterSet" : false,
"stats" : {
"type" : "LIMIT",
"works" : 12807,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 12705,
"needFetch" : 101,
"isEOF" : 1,
"children" : [
{
"type" : "FETCH",
"works" : 12807,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 12705,
"needFetch" : 101,
"isEOF" : 1,
"alreadyHasObj" : 0,
"forcedFetches" : 0,
"matchTested" : 0,
"children" : [
{
"type" : "IXSCAN",
"works" : 12705,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 12705,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 1,
"keyPattern" : "{ tweeted_at: 1.
0 }",
"boundsVerbose" : "field #0['twe
eted_at']: [MaxKey, MinKey]",
"isMultiKey" : 0,
"yieldMovedCursor" : 0,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0,
"keysExamined" : 12705,
"children" : [ ]
}
]
}
]
}
}
>
> db.tweets.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"unique" : true,
"key" : {
"tweet_id" : 1
},
"name" : "tweet_id_1",
"ns" : "honeypot.tweets",
"dropDups" : true
},
{
"v" : 1,
"key" : {
"tweeted_at" : 1
},
"name" : "tweeted_at_1",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"key" : {
"keywords" : 1
},
"name" : "keywords_1",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"key" : {
"user_id" : 1
},
"name" : "user_id_1",
"ns" : "honeypot.tweets"
}
]
>
By looking at the cursor field you can see which index was used:
"cursor" : "BtreeCursor tweeted_at_1 reverse",
BtreeCursor indicates that the query used an index and tweeted_at_1 reverse is the name of the index that was used.
You should check the documentation for each field in the explain to see a detailed description for each field.
Your query lasted 7720 ms (milis) and 12705 documents were scanned(nscanned).
The query is slow because MongoDB scanned all documents that matched your criteria. This happened because MongoDB didn't use your index for querying, but for sorting the data.
To create an index that will be used for querying and sorting, you should create a compound index. Compound index is a single index structure that references multiple fields. You can create a compound index with up to 31 field. You can create a compound index like this (order or fields is important):
db.tweets.ensureIndex({userid: 1, tweeted_at: -1});
This index will be used for searching on userid field and to sort by tweeted_at field.
You can read and see more examples about adding indexes for sorting here.
Edit
If you have other indexes MongoDB is maybe using them. When you're testing query performance you can use hint to use a specific index.
When testing performance of your queries you should always do multiple tests and take an approx. of the results.
Also, if your queries are slow, even when using indexes, then I would check if you have enough memory on the server. Loading the data from disk is order of magnitude slower then loading from the memory. You should always ensure that you have enough RAM, so that all of your data and indexes fit in memory.
Looks like you need to create an index on tweeted_at and userid.
db.tweets.ensureIndex({'tweeted_at':1, 'userid':1})
That should make the query very quick indeed (but at a cost of storage and insert time)

MongoDB - can not get a covered query

So I have an empty database 'tests' and a collection named 'test'.
First I ensured that my index was set correctly.
db.test.ensureIndex({t:1})
db.test.getIndices()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "tests.test"
},
{
"v" : 1,
"key" : {
"t" : 1
},
"name" : "t_1",
"ns" : "tests.test"
}
]
After that I inserted some test records.
db.test.insert({t:1234})
db.test.insert({t:5678})
When I query the DB with following command and let Mongo explain the results I get the following output:
db.test.find({t:1234},{_id:0}).explain()
{
"cursor" : "BtreeCursor t_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"t" : [
[
1234,
1234
]
]
},
"server" : "XXXXXX:27017",
"filterSet" : false
}
Can anyone please explain to me why indexOnly is false?
Thanks in advance.
To be a covered index query you need to only retrieve those fields that are in the index:
> db.test.find({ t: 1234 },{ _id: 0, t: 1}).explain()
{
"cursor" : "BtreeCursor t_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 0,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"t" : [
[
1234,
1234
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
Essentially this means that only the index is used in order to retrieve the data, without the need to go back to the actual document and retrieve further information. This can be as many fields as you need ( within reason ), but they do need to be included within the index and the only fields that are returned.
Hmm the reason has not been clearly explained (confusing me actually) so here is my effort.
Essentially in order for MongoDB to know that said index covers the query it has to know what fields you want.
If you just say you don't want _id how can it know that * - _id = t without looking?
Here * represents all fields, like it does in SQL.
Answer is it cannot. That is why you need to provide the full field/select/projection/whatever word they use for it definition so that MongoDB can know that your return fits the index.

Sort on $geoWithin geospatial query in MongoDB

I'm trying to retrieve a bunch of Polygons stored inside my db, and sort them by radius. So I wrote a query with a simple $geoWithin.
So, without sorting the code looks like this:
db.areas.find(
{
"geometry" : {
"$geoWithin" : {
"$geometry" : {
"type" : "Polygon",
"coordinates" : [ [ /** omissis: array of points **/ ] ]
}
}
}
}).limit(10).explain();
And the explain result is the following:
{
"cursor" : "S2Cursor",
"isMultiKey" : true,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 367,
"nscannedObjectsAllPlans" : 10,
"nscannedAllPlans" : 367,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 2,
"indexBounds" : {
},
"nscanned" : 367,
"matchTested" : NumberLong(10),
"geoTested" : NumberLong(10),
"cellsInCover" : NumberLong(27),
"server" : "*omissis*"
}
(Even if it's fast, it shows as cursor S2Cursor, letting me understand that my compound index has not been used. However, it's fast)
So, whenever I try to add a sort command, simply with .sort({ radius: -1 }), the query becomes extremely slow:
{
"cursor" : "S2Cursor",
"isMultiKey" : true,
"n" : 10,
"nscannedObjects" : 58429,
"nscanned" : 705337,
"nscannedObjectsAllPlans" : 58429,
"nscannedAllPlans" : 705337,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 3,
"nChunkSkips" : 0,
"millis" : 3186,
"indexBounds" : {
},
"nscanned" : 705337,
"matchTested" : NumberLong(58432),
"geoTested" : NumberLong(58432),
"cellsInCover" : NumberLong(27),
"server" : "*omissis*"
}
with MongoDB scanning all the documents. Obviously I tried to add a compound index, like { radius: -1, geometry : '2dsphere' } or { geometry : '2dsphere' , radius: -1 }, but nothing helped. Still very slow.
I would know if I'm using in the wrong way the compound index, if the S2Cursor tells me something I should change in my indexing strategy, overall, what I am doing wrong.
(PS: I'm using MongoDB 2.4.5+, so the problem is NOT caused by second field ascending in compound index when using 2dsphere index as reported here https://jira.mongodb.org/browse/SERVER-9647)
First of all, s2Cursor means that the query uses a geographic index.
There can be multiple reasons why the sort operation is slow, sort operation require memory, maybe your server has very little memory, you should consider executing sort operations in code, not at the server side.

In mongodb, why is it faster to query indexed subdocument array than indexed first level documents?

So this is how my database looks like:
> show dbs
admin 0.203125GB
local 0.078125GB
profiler 63.9228515625GB
> use profiler
switched to db profiler
> show collections
documents
mentions
A document in mentions is like this:
> db.mentions.findOne()
{
"_id" : ObjectId("51ec29ef1b63042f6a9c6fd2"),
"corpusID" : "GIGAWORD",
"docID" : "WPB_ENG_20100226.0044",
"url" : "http://en.wikipedia.org/wiki/Taboo",
"mention" : "taboos",
"offset" : 4526
}
A document in documents looks like this:
> db.documents.findOne()
{
"_id" : ObjectId("51ec2d981b63042f6ae4ca0b"),
"sentence_offsets" : [
..................
],
"docID" : "WPB_ENG_20101020.0002",
"text" : ".........",
"verb_offsets" : [
.............
],
"mentions" : [
{
"url" : "http://en.wikipedia.org/wiki/Washington,_D.C.",
"mention" : "Washington",
"ner" : "ORG",
"offset" : 122
},
...................
],
"corpusID" : "GIGAWORD",
"chunk_offsets" : [
.................
]
}
There are 100 million documents in mentions and 1.3 million in documents. Each mention appearing in mentions should also appear once in some document's mentions array. The reason I store mention info in documents was to avoid going into mentions to retrieve context. Yet when I query mentions only, I thought it should be faster to have an independent collection, mentions.
However, after I experimented index on both mentions.url/mentions.mention and documents.mentions.url/documents.mentions.mention, and query the same url/mention in both collections, I found it was twice as fast to get response from documents collection than from mentions collection.
I am not sure how the index works internally, but I assume both indexes have same size since there are equal number of mentions in both collections. Thus they should have same response time?
I was trying something like
> db.mentions.find({url: "http://en.wikipedia.org/wiki/Washington,_D.C."}).explain()
so there shouldn't be difference in network overhead.
Here is the output of
> db.mentions.find({mention: "Illinois"}).explain()
{
"cursor" : "BtreeCursor mention_1",
"isMultiKey" : false,
"n" : 4342,
"nscannedObjects" : 4342,
"nscanned" : 4342,
"nscannedObjectsAllPlans" : 4342,
"nscannedAllPlans" : 4342,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 14,
"nChunkSkips" : 0,
"millis" : 18627,
"indexBounds" : {
"mention" : [
[
"Illinois",
"Illinois"
]
]
},
"server" : "----:----"
}
and that of
> db.documents.find({"mentions.mention": "Illinois"}).explain()
{
"cursor" : "BtreeCursor mentions.mention_1",
"isMultiKey" : true,
"n" : 3102,
"nscannedObjects" : 3102,
"nscanned" : 3102,
"nscannedObjectsAllPlans" : 3102,
"nscannedAllPlans" : 3102,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 8,
"nChunkSkips" : 0,
"millis" : 7862,
"indexBounds" : {
"mentions.mention" : [
[
"Illinois",
"Illinois"
]
]
},
"server" : "----:----"
}
And the stats (yeah, I restored the collection and haven't indexed documents.url yet):
> db.documents.stats()
{
"ns" : "profiler.documents",
"count" : 1302957,
"size" : 23063622656,
"avgObjSize" : 17700.985263519826,
"storageSize" : 25188048768,
"numExtents" : 31,
"nindexes" : 2,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 3432652720,
"indexSizes" : {
"_id_" : 42286272,
"mentions.mention_1" : 3390366448
},
"ok" : 1
}
> db.mentions.stats()
{
"ns" : "profiler.mentions",
"count" : 97458884,
"size" : 15299979084,
"avgObjSize" : 156.98906509128506,
"storageSize" : 17891127216,
"numExtents" : 29,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 15578411408,
"indexSizes" : {
"_id_" : 3162125232,
"mention_1" : 4742881248,
"url_1" : 7673404928
},
"ok" : 1
}
I would appreciate if someone can tell me why this is happening. :]
There are 100 million documents in mentions and 1.3 million in
documents.
There are an equal number of index entries in both indexes because you told us you store mention both in documents and in mentions.
So index access time is the same. The way you can measure that is by running a covered index query from both - meaning you only want to get back the values that are stored in the index: db.x.find({url:"xxx"}, {_id:0, "url":1}) says find matching document and only return the value of url from it. If those two are not equal across the two connections, maybe something is unusual about your set-up, or one of the indexes can't fit in RAM or another measuring related issue.
If those two are the same but fetching the documents is consistently faster on documents collection, I would check and see why - complete explain output can show where the time is being spent - and whether more of one collection than the other happens to live in RAM, for example.