I'm running a mongoDB query and it's taking too long. I'm querying the collection "play_sessions" for the data of 9 users as seen in (1). My documents contain data for a gameplay session for a user as seen (2). I have an index on "user_id" and this index is being used as seen in the .explain() output in (3). My indexes in the .stats() output are shown in (4).
The mongoDB version is 2.6.1. There are approximately 4 million entires in "play_sessions" and 43,000 distinct users.
This example query takes around 2 min and the actual query of 800 users takes a lot longer. I'd like to know why this query is slow and what I can do to speed it up.
(1) The query:
db.play_sessions.find({user_id: {$in: users}}, {play_data:-1}
(2) Example document:
{
"_id" : 1903200,
"score" : 1,
"user_id" : 60538,
"time" : ISODate("2014-02-12T03:49:59.919Z"),
"level" : 1,
"user_attempt_no" : 2,
"game_id" : 181,
"play_data" : [
**Some JSON in here**
],
"time_sec" : 7.989
}
(3) .explain() output
{
"cursor" : "BtreeCursor user_id_1",
"isMultiKey" : false,
"n" : 13724,
"nscannedObjects" : 13724,
"nscanned" : 13732,
"nscannedObjectsAllPlans" : 14128,
"nscannedAllPlans" : 14140,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 4463,
"nChunkSkips" : 0,
"millis" : 123631,
"indexBounds" : {
"user_id" : [
[
41930,
41930
],
...,
[
67112,
67112
]
]
},
"server" : "...",
"filterSet" : false
}
(4) .stats() output for the collection:
{
"ns" : "XXX.play_sessions",
"count" : 3957328,
"size" : 318453446112,
"avgObjSize" : 80471,
"storageSize" : 319917328096,
"numExtents" : 169,
"nindexes" : 10,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 1962280880,
"indexSizes" : {
"_id_" : 184205280,
"game_id_1" : 167681584,
"user_id_1" : 113997968,
"user_id_1_game_id_1_level_1_time_1" : 288972544,
"game_id_1_level_1" : 141027824,
"game_id_1_level_1_user_id_1_time_1" : 301645344,
"user_id_1_game_id_1_level_1" : 228674544,
"game_id_1_level_1_user_id_1" : 245549808,
"user_id_1_user_attempt_no_1" : 135958704,
"user_id_1_time_1" : 154567280
},
"ok" : 1
}
Related
I have a large set of documents like this:
{
"_id" : ObjectId("572e33baf082e29c46cadb7b"),
"nodes" : [
[
39.2035598754883,
51.6601028442383
],
[
39.2038421630859,
51.6602439880371
],
[
39.2038688659668,
51.6602249145508
]
]
}
And I want to search documents by coordinates. My query is:
db.points.find({nodes: {$elemMatch: {$geoWithin:
{$box: [[39.1981, 51.660], [39.206, 51.664]]}}}})
I also added an index
db.points.createIndex( { "nodes": "2d" } )
but it has no effect. Collection stats:
{
"ns" : "base.points",
"count" : 215583,
"size" : 61338720,
"avgObjSize" : 284,
"storageSize" : 86310912,
"numExtents" : 10,
"nindexes" : 3,
"lastExtentSize" : 27869184,
"paddingFactor" : 1.0,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 39759888,
"indexSizes" : {
"_id_" : 7006832,
"nodes_2d" : 26719168
},
"ok" : 1.0
}
And explain:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 61,
"nscannedObjects" : 215583,
"nscanned" : 215583,
"nscannedObjectsAllPlans" : 215583,
"nscannedAllPlans" : 215583,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1684,
"nChunkSkips" : 0,
"millis" : 1466,
"server" : "DEVELOP-PC:27017",
"filterSet" : false
}
The solution was very simple:
db.points.find({nodes: {$geoWithin: {$box: [[39.1981, 51.660],
[39.206, 51.664]]}}})
should be excluded from the query $elemMatch and the index will work correctly
I have a collection of tweets, with indexes on userid and tweeted_at (date). I want to find the dates of the oldest and newest tweets in the collection for a user, but the query runs very slowly.
I used explain, and here's what I got. I tried reading the documentation for explain, but I don't understand what is going on here. Is the explain just on the sort? If so, why does it take so long when it's using the index?
> db.tweets.find({userid:50263}).sort({tweeted_at:-1}).limit(1).explain(1)
{
"cursor" : "BtreeCursor tweeted_at_1 reverse",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 12705,
"nscanned" : 12705,
"nscannedObjectsAllPlans" : 12705,
"nscannedAllPlans" : 12705,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 188,
"nChunkSkips" : 0,
"millis" : 7720,
"indexBounds" : {
"tweeted_at" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor tweeted_at_1 reverse",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 12705,
"nscanned" : 12705,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tweeted_at" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
}
}
],
"server" : "adams-server:27017",
"filterSet" : false,
"stats" : {
"type" : "LIMIT",
"works" : 12807,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 12705,
"needFetch" : 101,
"isEOF" : 1,
"children" : [
{
"type" : "FETCH",
"works" : 12807,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 0,
"needTime" : 12705,
"needFetch" : 101,
"isEOF" : 1,
"alreadyHasObj" : 0,
"forcedFetches" : 0,
"matchTested" : 0,
"children" : [
{
"type" : "IXSCAN",
"works" : 12705,
"yields" : 188,
"unyields" : 188,
"invalidates" : 0,
"advanced" : 12705,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 1,
"keyPattern" : "{ tweeted_at: 1.
0 }",
"boundsVerbose" : "field #0['twe
eted_at']: [MaxKey, MinKey]",
"isMultiKey" : 0,
"yieldMovedCursor" : 0,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0,
"keysExamined" : 12705,
"children" : [ ]
}
]
}
]
}
}
>
> db.tweets.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"unique" : true,
"key" : {
"tweet_id" : 1
},
"name" : "tweet_id_1",
"ns" : "honeypot.tweets",
"dropDups" : true
},
{
"v" : 1,
"key" : {
"tweeted_at" : 1
},
"name" : "tweeted_at_1",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"key" : {
"keywords" : 1
},
"name" : "keywords_1",
"ns" : "honeypot.tweets"
},
{
"v" : 1,
"key" : {
"user_id" : 1
},
"name" : "user_id_1",
"ns" : "honeypot.tweets"
}
]
>
By looking at the cursor field you can see which index was used:
"cursor" : "BtreeCursor tweeted_at_1 reverse",
BtreeCursor indicates that the query used an index and tweeted_at_1 reverse is the name of the index that was used.
You should check the documentation for each field in the explain to see a detailed description for each field.
Your query lasted 7720 ms (milis) and 12705 documents were scanned(nscanned).
The query is slow because MongoDB scanned all documents that matched your criteria. This happened because MongoDB didn't use your index for querying, but for sorting the data.
To create an index that will be used for querying and sorting, you should create a compound index. Compound index is a single index structure that references multiple fields. You can create a compound index with up to 31 field. You can create a compound index like this (order or fields is important):
db.tweets.ensureIndex({userid: 1, tweeted_at: -1});
This index will be used for searching on userid field and to sort by tweeted_at field.
You can read and see more examples about adding indexes for sorting here.
Edit
If you have other indexes MongoDB is maybe using them. When you're testing query performance you can use hint to use a specific index.
When testing performance of your queries you should always do multiple tests and take an approx. of the results.
Also, if your queries are slow, even when using indexes, then I would check if you have enough memory on the server. Loading the data from disk is order of magnitude slower then loading from the memory. You should always ensure that you have enough RAM, so that all of your data and indexes fit in memory.
Looks like you need to create an index on tweeted_at and userid.
db.tweets.ensureIndex({'tweeted_at':1, 'userid':1})
That should make the query very quick indeed (but at a cost of storage and insert time)
I am having hard time with something that supposed to be trivial....
I have the following profile document structure:
{
pid:"profileId",
loc : {
"lat" : 32.082156661684621,
"lon" : 34.813229013156551,
"locTime" : NumberLong(0)
}
age:29
}
A common use-case in my app is to retrieve nearby profiles filtered by age.
{ "loc" : { "$near" : [ 32.08290052711715 , 34.80888522811172] , "$maxDistance" : 179.98560115190784}, "age" : { "$gte" : 0 , "$lte" : 33}}
So I have created the following compound index:
{ 'loc':2d , age:1}
And no matter what I do I can't make the query run with the created index (also tried with hint)
this is the generated explain for the query:
{
"cursor" : "GeoSearchCursor" ,
"isMultiKey" : false ,
"n" : 4 ,
"nscannedObjects" : 4 ,
"nscanned" : 4 ,
"nscannedObjectsAllPlans" : 4 ,
"nscannedAllPlans" : 4 ,
"scanAndOrder" : false ,
"indexOnly" : false ,
"nYields" : 0 ,
"nChunkSkips" : 0 ,
"millis" : 0 ,
"indexBounds" : { } ,
"allPlans" : [ { "cursor" : "GeoSearchCursor" , "n" : 4 , "nscannedObjects" : 4 , "nscanned" : 4 , "indexBounds" : { }
}
I am using mongodb version 2.4.4.
What am I doing wrong? your answer is highly appreciated.
The explain output says "cursor" : "GeoSearchCursor". This indicates your query used a geospatial index.
See the following for details:
http://docs.mongodb.org/manual/reference/method/cursor.explain/
2d indexes support a compound index with only one additional field, as a suffix of the 2d index field.
http://docs.mongodb.org/manual/applications/geospatial-indexes
As #stennie mentioned in the comment on your question the problem might be the ordering of the coordinates. They should be ordered long, lat. If that doesn't work try storing the loc as an array with long first element, lat second.
Here is a worked example:
I created three profile objects with location as array and the locTime separate from loc.
> db.profile.find()
{ "_id" : ObjectId("52cd54f1c43bb3a468b9fd0d"), "loc" : [ -6, 50 ], "age" : 29, "pid" : "001", "locTime" : NumberLong(0) }
{ "_id" : ObjectId("52cd5507c43bb3a468b9fd0f"), "loc" : [ -6, 53 ], "age" : 30, "pid" : "002", "locTime" : NumberLong(1) }
{ "_id" : ObjectId("52cd5515c43bb3a468b9fd10"), "loc" : [ -1, 51 ], "age" : 31, "pid" : "003", "loctime" : NumberLong(2) }
Finding using large distance and age
> db.profile.find({ "loc" : { "$near" : [ -1, 50] , "$maxDistance" : 5}, "age" : { "$gte" : 0 , "$lte" : 33}})
{ "_id" : ObjectId("52cd5515c43bb3a468b9fd10"), "loc" : [ -1, 51 ], "age" : 31, "pid" : "003", "loctime" : NumberLong(2) }
{ "_id" : ObjectId("52cd54f1c43bb3a468b9fd0d"), "loc" : [ -6, 50 ], "age" : 29, "pid" : "001", "locTime" : NumberLong(0) }
The explain shows the index is being used:
> db.profile.find({ "loc" : { "$near" : [ -1, 50] , "$maxDistance" : 5}, "age" : { "$gte" : 0 , "$lte" : 33}}).explain()
{
"cursor" : "GeoSearchCursor",
"isMultiKey" : false,
"n" : 2,
"nscannedObjects" : 2,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
}
Narrow the distance with the same age range
> db.profile.find({ "loc" : { "$near" : [ -1, 50] , "$maxDistance" : 1}, "age" : { "$gte" : 0 , "$lte" : 33}})
Here is the explain, again the index is used:
> db.profile.find({ "loc" : { "$near" : [ -1, 50] , "$maxDistance" : 1}, "age" : { "$gte" : 0 , "$lte" : 33}}).explain()
{
"cursor" : "GeoSearchCursor",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
}
Here are the indexes:
> db.profile.getIndices()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.profile",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"loc" : "2d",
"age" : 1
},
"ns" : "test.profile",
"name" : "loc_2d_age_1"
}
]
I store a very long string to my document in mongodb server, my document looks like
{
"_id": ObjectId("5280efdbe4b062c93b582118"),
"viewID": 1,
"content": "a very long string"
}
and there are about hundreds of such documents in the same collection. I have indexed the collection.
The problem is that I just execute a very simple query
db.blog.find({viewID:1});
and then get a very long time (over several hours) to wait for the response, and no any error message displayed, server status is OK. But I just add db.blog.find({viewID:1}).limit(1); mongodb returns the result at once.
How can I do to solve the problem or improve the performance??
Here is my explain:
db.blog.find({viewID:1}).explain();
{
"cursor" : "BtreeCursor viewID_1",
"isMultiKey" : false,
"n" : 377,
"nscannedObjects" : 377,
"nscanned" : 377,
"nscannedObjectsAllPlans" : 377,
"nscannedAllPlans" : 377,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 45,
"indexBounds" : {
"viewID" : [
[
1,
1
]
]
},
"server" : "Server:27017"
}
Here is colStat:
db.blog.stats();
{
"ns" : "***.blog",
"count" : 98582,
"size" : 2330759632,
"avgObjSize" : 23642.851960804204,
"storageSize" : 2958364672,
"numExtents" : 18,
"nindexes" : 2,
"lastExtentSize" : 778276864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 6540800,
"indexSizes" : {
"_id_" : 4055296,
"viewID_1" : 2485504
},
"ok" : 1
}
So this is how my database looks like:
> show dbs
admin 0.203125GB
local 0.078125GB
profiler 63.9228515625GB
> use profiler
switched to db profiler
> show collections
documents
mentions
A document in mentions is like this:
> db.mentions.findOne()
{
"_id" : ObjectId("51ec29ef1b63042f6a9c6fd2"),
"corpusID" : "GIGAWORD",
"docID" : "WPB_ENG_20100226.0044",
"url" : "http://en.wikipedia.org/wiki/Taboo",
"mention" : "taboos",
"offset" : 4526
}
A document in documents looks like this:
> db.documents.findOne()
{
"_id" : ObjectId("51ec2d981b63042f6ae4ca0b"),
"sentence_offsets" : [
..................
],
"docID" : "WPB_ENG_20101020.0002",
"text" : ".........",
"verb_offsets" : [
.............
],
"mentions" : [
{
"url" : "http://en.wikipedia.org/wiki/Washington,_D.C.",
"mention" : "Washington",
"ner" : "ORG",
"offset" : 122
},
...................
],
"corpusID" : "GIGAWORD",
"chunk_offsets" : [
.................
]
}
There are 100 million documents in mentions and 1.3 million in documents. Each mention appearing in mentions should also appear once in some document's mentions array. The reason I store mention info in documents was to avoid going into mentions to retrieve context. Yet when I query mentions only, I thought it should be faster to have an independent collection, mentions.
However, after I experimented index on both mentions.url/mentions.mention and documents.mentions.url/documents.mentions.mention, and query the same url/mention in both collections, I found it was twice as fast to get response from documents collection than from mentions collection.
I am not sure how the index works internally, but I assume both indexes have same size since there are equal number of mentions in both collections. Thus they should have same response time?
I was trying something like
> db.mentions.find({url: "http://en.wikipedia.org/wiki/Washington,_D.C."}).explain()
so there shouldn't be difference in network overhead.
Here is the output of
> db.mentions.find({mention: "Illinois"}).explain()
{
"cursor" : "BtreeCursor mention_1",
"isMultiKey" : false,
"n" : 4342,
"nscannedObjects" : 4342,
"nscanned" : 4342,
"nscannedObjectsAllPlans" : 4342,
"nscannedAllPlans" : 4342,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 14,
"nChunkSkips" : 0,
"millis" : 18627,
"indexBounds" : {
"mention" : [
[
"Illinois",
"Illinois"
]
]
},
"server" : "----:----"
}
and that of
> db.documents.find({"mentions.mention": "Illinois"}).explain()
{
"cursor" : "BtreeCursor mentions.mention_1",
"isMultiKey" : true,
"n" : 3102,
"nscannedObjects" : 3102,
"nscanned" : 3102,
"nscannedObjectsAllPlans" : 3102,
"nscannedAllPlans" : 3102,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 8,
"nChunkSkips" : 0,
"millis" : 7862,
"indexBounds" : {
"mentions.mention" : [
[
"Illinois",
"Illinois"
]
]
},
"server" : "----:----"
}
And the stats (yeah, I restored the collection and haven't indexed documents.url yet):
> db.documents.stats()
{
"ns" : "profiler.documents",
"count" : 1302957,
"size" : 23063622656,
"avgObjSize" : 17700.985263519826,
"storageSize" : 25188048768,
"numExtents" : 31,
"nindexes" : 2,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 3432652720,
"indexSizes" : {
"_id_" : 42286272,
"mentions.mention_1" : 3390366448
},
"ok" : 1
}
> db.mentions.stats()
{
"ns" : "profiler.mentions",
"count" : 97458884,
"size" : 15299979084,
"avgObjSize" : 156.98906509128506,
"storageSize" : 17891127216,
"numExtents" : 29,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 15578411408,
"indexSizes" : {
"_id_" : 3162125232,
"mention_1" : 4742881248,
"url_1" : 7673404928
},
"ok" : 1
}
I would appreciate if someone can tell me why this is happening. :]
There are 100 million documents in mentions and 1.3 million in
documents.
There are an equal number of index entries in both indexes because you told us you store mention both in documents and in mentions.
So index access time is the same. The way you can measure that is by running a covered index query from both - meaning you only want to get back the values that are stored in the index: db.x.find({url:"xxx"}, {_id:0, "url":1}) says find matching document and only return the value of url from it. If those two are not equal across the two connections, maybe something is unusual about your set-up, or one of the indexes can't fit in RAM or another measuring related issue.
If those two are the same but fetching the documents is consistently faster on documents collection, I would check and see why - complete explain output can show where the time is being spent - and whether more of one collection than the other happens to live in RAM, for example.