MongoDb performance slow even using index - mongodb

We are trying to build a notification application for our users with mongo. we created 1 mongodb on 10GB RAM, 150GB SAS HDD 15K RPM, 4 Core 2.9GHZ xeon intel XEN VM.
DB schema :-
{
"_id" : ObjectId("5178c458e4b0e2f3cee77d47"),
"userId" : NumberLong(1574631),
"type" : 2,
"text" : "a user connected to B",
"status" : 0,
"createdDate" : ISODate("2013-04-25T05:51:19.995Z"),
"modifiedDate" : ISODate("2013-04-25T05:51:19.995Z"),
"metadata" : "{\"INVITEE_NAME\":\"2344\",\"INVITEE\":1232143,\"INVITE_SENDER\":1574476,\"INVITE_SENDER_NAME\":\"123213\"}",
"opType" : 1,
"actorId" : NumberLong(1574630),
"actorName" : "2344"
}
DB stats :-
db.stats()
{
"db" : "UserNotificationDev2",
"collections" : 3,
"objects" : 78597973,
"avgObjSize" : 489.00035699393925,
"dataSize" : 38434436856,
"storageSize" : 41501835008,
"numExtents" : 42,
"indexes" : 2,
"indexSize" : 4272393328,
"fileSize" : 49301946368,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
index :- userid and _id
we are trying to select latest 21 notifications for one user.
db.userNotification.find({ "userId" : 53 }).limit(21).sort({ "_id" : -1 });
but this query is taking too much time.
Fri Apr 26 05:39:55.563 [conn156] query UserNotificationDev2.userNotification query: { query: { userId: 53 }, orderby: { _id: -1 } } cursorid:225321382318166794 ntoreturn:21 ntoskip:0 nscanned:266025 keyUpdates:0 numYields: 2 locks(micros) r:4224498 nreturned:21 reslen:10295 2581ms
even count is taking hell lot of time.
Fri Apr 26 05:47:46.005 [conn159] command UserNotificationDev2.$cmd command: { count: "userNotification", query: { userId: 53 } } ntoreturn:1 keyUpdates:0 numYields: 11 locks(micros) r:9753890 reslen:48 5022ms
Are we doing some wrong in the query?
Please help!!!
Also suggest if our schema is not correct to storing user notifications. we have tried a embedded notifications like user and then notification for that user under that document but document limit is limiting us to store only ~50k notifications. so we changed to this.

You are querying by userId but not indexing it anywhere. My suggestion is to create an index on { "userId" : 1, "_id" : -1 }. This will create an index tree that starts with userId, then _id, which is almost exactly what your query is doing. This is the simplest/most flexible way to speeding up your query.
Another, more memory efficient, approach is to store your userId and timestamp as a string in _id, like _id : "USER_ID:DATETIME. Ex :
{_id : "12345:20120501123000"}
{_id : "15897:20120501124000"}
{_id : "15897:20120501125000"}
Notice _id is a string, not MongoId. Then your query above becomes a regex :
db.userNotification.find({ "_id" : /^53:/ }).limit(21).sort({ "_id" : -1 });
As expected, this will return all notifications for userId 53 in descending order. The memory efficient part is two fold:
You only need one index field. (Indexes compete with data for memory and are often several gigs in size)
If your queries are often about fetching newer data Right Balanced indexes keep your most often working in memory when the indexes are too large to fit whole.
Re: count. Count does take time because it scans through the entire collection.
Re: your schema. I'm guessing for your data set this is the best way to utilize your memory. When objects get large and your queries scan across multiple objects they will need to be loaded into memory in their entirety (I've had the OOM killer kill my mongod instance when i sorted with 2000 2MB objects on a 2GB RAM machine). With large objects your RAM usage will fluctuate greatly (not to mention they are limited upto a point). With your current schema mongo will have a much easier time loading only the data you're querying, resulting in less swapping and more consistent memory usage patterns.

One option is to try sharding then you can distribute notifications evenly between shards so when you need to select you will scan smaller subset of data. Need to decide however what your sharding will be using. To me it looks like operationType or userName but I do not know your data well enough. Another thing is why do you sort by _id?

I have just tried to replicate your problem. Created 140.000.000 inserts in userNotifications.
Without index on userId I got responses of 3-4seconds. After I created index on userId time dropped to almost instant responses.
db.userNotifications.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.userNotifications",
"name" : "id"
},
{
"v" : 1,
"key" : {
"userId" : 1
},
"ns" : "test.userNotifications",
"name" : "userId_1"
}
]
Another thing is: When your select happens is system writing constantly to mongo userNotification collection? Mongo locks whole collection if that happens. If it is the case
I would split read and writes between master and slave (see replication) and also do some sharding. Btw. What language you use for your app?

The most important thing is that you currently don't seem to have an index to support the query for user's latest notifications.
You need a compound index on userId, _id. This will support queries which only query by userId, but they also are used by queries by userId which sort/limit by _id.
When you add {userId:1, _id:-1} index don't forget to drop index on just userId as it will become redundant.
As far as count() make sure you are using 2.4.3 (the latest version) there were significant improvements in how count() uses indexes which resulted in much better performance.

Related

MongoDB - performance and collection size

I have a question with regards to collection sizes and query performance –
There are 2 dbs– DB1 & DB2. DB1 has 1 collection, and here’s the output from stats() on this collection –
{
…
"count" : 2085217,
"size" : 17048734192,
"avgObjSize" : 8176,
"capped" : false,
"nindexes" : 3,
"indexDetails" : {},
"totalIndexSize" : 606299456,
"indexSizes" : {
"_id_" : 67664576,
"id_1" : 284165056,
"id_2" : 254469824
},
…
}
A query on this collection, using index id_1 comes back in 0.012 secs. Here’s the output from explain() -
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
….
"indexName" : "id_1",
}
In DB2, I have 4 collections, and here’s the output from stats on DB2 –
{
…
"collections" : 4,
"objects" : 152655935,
"avgObjSize" : 8175.998307514215,
"dataSize" : 1248114666192,
"storageSize" : 1257144933456,
"indexes" : 12,
"indexSize" : 19757688272,
"fileSize" : 1283502112768,
…
}
A query on any collection in DB2, using the index, which I confirmed via explain(), takes at least double the time that it does for the previous query against DB1.
Since mongo should scale well, why is there this diff? I read that mongodb loads all the indexes in memory, and since DB2 has a higher volume than DB1, is that why it’s taking much longer?
Any insights would be greatly helpful. Thanks.
Edit 1:
Adding more info re. collection definition, indexes definitions and queries executed...
All collections (in both DBs) contain the same fields; only the values and the size of documents differ between them.
And, here's the relevant index -
"1" : {
"v" : 1,
"unique" : true,
"key" : {
"id" : 1
},
"name" : "id_1",
"ns" : "ns.coll1"
}
And, this is how the id field looks like:
"_id" : ObjectId("55f9b6548aefbce6b2fa2fac"),
"id" : {
"pid" : {
"f1" : "val1",
"f2" : "val2"
}
},
And, here's a sample query -
db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}})
Edit 2:
Here's some more info on the hard disk & RAM -
$ free -m
total used free shared buff/cache available
Mem: 386758 3750 1947 25283 381060 355675
Swap: 131071 3194 127877
The hard disk is around 3.5T, out of which 2.9T is already used.
Scaling
MongoDB scales very well. The thing is, it is designed to scale horizontally, not vertically. This means that if your DBs are holding a lot of data, you should shard the collections in order to achieve better parallelization.
Benchmark results
Regarding the difference in query time, I don't think your profiling is conclusive. The DBs are possibly on different machines (with different specs). Supposing the hardware is the same, DB2 apparently holds more documents on its collections and the size of documents are not the same on both DBs. The same query can return data sets with different sizes. That will inevitably have impact on data serialization and other low level aspects. Unless you profile the queries in a more controlled setup, I think your results are pretty much expected.
Suggestions
Take care if you are using DRef on your documents. Its possible Mongo will automatically dereference them; that means more that data to serialize and overhead.
Try running the same queries with a limit specification. You have defined the index to be unique, but I don't know if that automatically makes Mongo stop index traversal once it has found a value. Check if db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}) and db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}).limit(1) run on the same time.
Take a look at Indexes on embedded fields and Indexes on embedded documents. Embedded documents seem to impair even extra overhead.
Finally, if your document has no embedded documents, only embedded fields (which seems to be the case), then define your index more specifically. Create this index
db.coll1.createIndex({"id.pid.f1": 1, "id.pid.f2": 1}, {unique: true})
and run the query again. If this index doesn't improve performance, then I believe you have done everything properly and it may be time to start sharding.

Index strategy for queries with dynamic match criteria

I have a collection which is going to hold machine data as well as mobile data, the data is captured on channel and is maintained at single level no embedding of object , the structure is like as follows
{
"Id": ObjectId("544e4b0ae4b039d388a2ae3a"),
"DeviceTypeId":"DeviceType1",
"DeviceTypeParentId":"Parent1",
"DeviceId":"D1",
"ChannelName": "Login",
"Timestamp": ISODate("2013-07-23T19:44:09Z"),
"Country": "India",
"Region": "Maharashtra",
"City": "Nasik",
"Latitude": 13.22,
"Longitude": 56.32,
//and more 10 - 15 fields
}
Most of the queries are aggregation queries, as used for Analytics dashboard and real-time analysis , the $match pipeline is as follows
{$match:{"DeviceTypeId":{"$in":["DeviceType1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
or
{$match:{"DeviceTypeParentId":{"$in":["Parent1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
and many of my DAL layer find queries and findOne queries are mostly on criteria DeviceType or DeviceTypeParentId.
The collection is huge and its growing, I have used compound index to support this queries, indexes are as follows
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceType" : 1,
"Timestamp" : 1
},
"name" : "DeviceType_1_Timestamp_1",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceTypeParentId" : 1,
"Timestamp" : 1
},
"name" : "DeviceTypeParentId_1_Timestamp_1",
"ns" : "DB.channel_data"
}
]
Now we are going to add support for match criteria on DeviceId and if I follow same strategy as I did for DeviceType and DeviceTypeParentId is not good,as I fell by my current approach I'm creating many indexes and all most all will be same and huge.
So is their any good way to do indexing . I have read a bit about Index Intersection but not sure how will it be helpful.
If any wrong approach is followed by me please point it out as this is my first project and first time I am using MongoDB.
Those indexes all look appropriate for your queries, including the new one you're proposing. Three separate indexes supporting your three kinds of queries are the overall best option in terms of fast queries. You could put indexes on each field and let the planner use index intersection, but it won't be as good as the compound indexes. The indexes are not the same since they support different queries.
I think the real question is, are the (apparently) large memory footprint of the indices actually a problem at this point? Do you have a lot of page faults because of paging indexes and data out of disk?

Insert operation became very slow for MongoDB

The client is pymongo.
The program has been running for one week. It's indeed very fast to insert data before: about 10 million / 30 minutes.
But today i found the insert operation became very very slow.
There are about 120 million records in the goods collection now.
> db.goods.count()
123535156
And the indexs for goods collection is as following:
db.goods.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "shop.goods",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"item_id" : 1,
"updated_at" : -1
},
"unique" : true,
"ns" : "shop.goods",
"name" : "item_id_1_updated_at_-1"
},
{
"v" : 1,
"key" : {
"updated_at" : 1
},
"ns" : "shop.goods",
"name" : "updated_at_1"
},
{
"v" : 1,
"key" : {
"item_id" : 1
},
"ns" : "shop.goods",
"name" : "item_id_1"
}
]
And there is enough RAM and CPU.
Someone told me because there are too many records. But didn't tell me how to solve this problem. I was a bit disappointed with the MongoDB.
There will be more data needs to be stored in future(about 50 million new records per day). Is there any solution?
Met same situation on another sever(Less data this time, total about 40 million), the current insert speed is about 5 records per second.
> db.products.stats()
{
"ns" : "c2c.products",
"count" : 42389635,
"size" : 554721283200,
"avgObjSize" : 13086.248164203349,
"storageSize" : 560415723712,
"numExtents" : 283,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0000000000132128,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 4257185968,
"indexSizes" : {
"_id_" : 1375325840,
"product_id_1" : 1687460992,
"created_at_1" : 1194399136
},
"ok" : 1
}
I don't know if it is your problem, but take in mind that MongoDB has to update index for each insert. So if you have many indexes, and many documents, performance could be lower than expected.
Maybe, you can speed up inserts operations using sharding. You don't mention it in your question, so I guess you are not using it.
Anyway, could you provide us more information? You can use db.goods.stats(), db.ServerStatus or any of theese other methods to gather information about performance of your database.
Another possible problem is IO. Depending on your scenario Mongo might be busy trying to grow or allocate storage files for the given namespace (i.e. DB) for the subsequent insert statements. If your test pattern has been add records / delete records / add records / delete records you are likely reusing existing allocated space. If your app is now running longer than before you might be in the situation I described.
Hope this sheds some light on your situation.
I had a very similar problem.
First you need to make sure which is your bottleneck (CPU, memory and Disk IO). I use several unix tools (such as top, iotop, etc) to detect the bottleneck. In my case I found insertion speed was lagged by IO speed because mongod often took 99% io usage. (Note: my original db used mmapv1 storage engine).
My work around was to change storage engine to wiredtiger. (either by mongodump your original db then mongorestore into wiredtiger format, or start a new mongod with wiredtiger engine and then resync from other replica set memebers.) My insertion speed went to normal after doing that.
However, I am still not sure why mongod with mmapv1 suddenly drained IO usages after the size of documents reached a point.

Unreasonably slow MongoDB query, even though the query is simple and aligned to indexes

I'm running a MongoDB server (that's literally all it has running). The server has 64gb of RAM and 16 cores, plus 2TB of hard drive space to work with.
The Document Structure
The database has a collection domains with around 20 million documents. There is a decent amount of data in each document, but for our purposes, The document is structured like so:
{
_id: "abcxyz.com",
LastUpdated: <date>,
...
}
The _id field is the domain name referenced by the document. There is an ascending index on LastUpdated. LastUpdated is updated on hundreds of thousands of records per day. Basically every time new data becomes available for a document, the document is updated and the LastUpdated field updated to the current date/time.
The Query
I have a mechanism that extracts the data from the database so it can be indexed in a Lucene index. The LastUpdated field is the key driver for flagging changes made to a document. In order to search for documents that have been changed and page through those documents, I do the following:
{
LastUpdated: { $gte: ISODate(<firstdate>), $lt: ISODate(<lastdate>) },
_id: { $gt: <last_id_from_previous_page> }
}
sort: { $_id:1 }
When no documents are returned, the start and end dates move forward and the _id "anchor" field is reset. This setup is tolerant to documents from previous pages that have had their LastUpdated value changed, i.e. the paging won't become incorrectly offset by the number of documents in previous pages that are now technically no longer in those pages.
The Problem
I want to ideally select about 25000 documents at a time, but for some reason the query itself (even when only selecting <500 documents) is extremely slow.
The query I ran was:
db.domains.find({
"LastUpdated" : {
"$gte" : ISODate("2011-11-22T15:01:54.851Z"),
"$lt" : ISODate("2011-11-22T17:39:48.013Z")
},
"_id" : { "$gt" : "1300broadband.com" }
}).sort({ _id:1 }).limit(50).explain()
It is so slow in fact that the explain (at the time of writing this) has been running for over 10 minutes and has not yet completed. I will update this question if it ever finishes, but the point of course is that the query is EXTREMELY slow.
What can I do? I don't have the faintest clue what the problem might be with the query.
EDIT
The explain finished after 55 minutes. Here it is:
{
"cursor" : "BtreeCursor Lastupdated_-1__id_1",
"nscanned" : 13112,
"nscannedObjects" : 13100,
"n" : 50,
"scanAndOrder" : true,
"millis" : 3347845,
"nYields" : 5454,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"LastUpdated" : [
[
ISODate("2011-11-22T17:39:48.013Z"),
ISODate("2011-11-22T15:01:54.851Z")
]
],
"_id" : [
[
"1300broadband.com",
{
}
]
]
}
}
Bumped into a very similar problem, and the Indexing Advice and FAQ on Mongodb.org says, quote:
The range query must also be the last column in an index
So if you have the keys a,b and c and run db.ensureIndex({a:1, b:1, c:1}), these are the "guidelines" in order use the index as much as possible:
Good:
find(a=1,b>2)
find(a>1 and a<10)
find(a>1 and a<10).sort(a)
Bad:
find(a>1, b=2)
Only use a range query OR sort on one column.
Good:
find(a=1,b=2).sort(c)
find(a=1,b>2)
find(a=1,b>2 and b<4)
find(a=1,b>2).sort(b)
Bad:
find(a>1,b>2)
find(a=1,b>2).sort(c)
Hope it helps!
/J
Ok I solved it. The culprit was "scanAndOrder": true, which suggested that the index wasn't being used as intended. The correct composite index has the the primary sort field first and then the fields being queried on.
{ "_id":1, "LastUpdated":1 }
Have you tried adding _id to your composite index. As you're using it as part of the query won't it still have to do a full table scan?

MongoDB - too much data for sort() with no index error

I am using MongoDB 1.6.3, to store a big collection (300k+ records). I added a composite index.
db['collection_name'].getIndexes()
[
{
"name" : "_id_",
"ns" : "db_name.event_logs",
"key" : {
"_id" : 1
}
},
{
"key" : {
"updated_at.t" : -1,
"community_id" : 1
},
"ns" : "db_name.event_logs",
"background" : true,
"name" : "updated_at.t_-1_community_id_1"
}
]
However, when I try to run this code:
db['collection_name']
.find({:community_id => 1})
.sort(['updated_at.t', -1])
.skip(#skip)
.limit(#limit)
I am getting:
Mongo::OperationFailure (too much data
for sort() with no index. add an
index or specify a smaller limit)
What am I doing wrong?
Try adding {community_id: 1, 'updated_at.t': -1} index. It needs to search by community_id first and then sort.
So it "feels" like you're using the index, but the index is actually a composite index. I'm not sure that the sort is "smart enough" to use only the partial index.
So two problems:
Based on your query, I would put community_id as the first part of the index, not the second. updated_at.t sounds like a field on which you'll do range queries. Indexes work better if the range query is the second bit.
How many entries are going to come back from community_id => 1? If the number is not big, you may be able to get away with just sorting without an index.
So you may have to switch the index around and you may have to change the sort to use both community_id and updated_at.t. I know it seems redundant, but start there and check the Google Groups if it's still not working.
Even with an index, I think you can still get that error if your result set exceeds 4MB.
You can see the size by going into the mongodb console and doing this:
show dbs
# pick yours (e.g., production)
use db-production
db.articles.stats()
I ended up with results like this:
{
"ns" : "mdalert-production.encounters",
"count" : 89077,
"size" : 62974416,
"avgObjSize" : 706.9660630690302,
"storageSize" : 85170176,
"numExtents" : 8,
"nindexes" : 6,
"lastExtentSize" : 25819648,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 18808832,
"indexSizes" : {
"_id_" : 3719168,
"patient_num_1" : 3440640,
"msg_timestamp_1" : 2981888,
"practice_id_1" : 2342912,
"patient_id_1" : 3342336,
"msg_timestamp_-1" : 2981888
},
"ok" : 1
}
Having a cursor batch size that is too large will cause this error. Setting the batch size does not limit the amount of data you can process, it just limits how much data is brought back from the database. When you iterate through and hit the batch limit, the process will make another trip to the database.