MongoDB - performance and collection size - mongodb

I have a question with regards to collection sizes and query performance –
There are 2 dbs– DB1 & DB2. DB1 has 1 collection, and here’s the output from stats() on this collection –
{
…
"count" : 2085217,
"size" : 17048734192,
"avgObjSize" : 8176,
"capped" : false,
"nindexes" : 3,
"indexDetails" : {},
"totalIndexSize" : 606299456,
"indexSizes" : {
"_id_" : 67664576,
"id_1" : 284165056,
"id_2" : 254469824
},
…
}
A query on this collection, using index id_1 comes back in 0.012 secs. Here’s the output from explain() -
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
….
"indexName" : "id_1",
}
In DB2, I have 4 collections, and here’s the output from stats on DB2 –
{
…
"collections" : 4,
"objects" : 152655935,
"avgObjSize" : 8175.998307514215,
"dataSize" : 1248114666192,
"storageSize" : 1257144933456,
"indexes" : 12,
"indexSize" : 19757688272,
"fileSize" : 1283502112768,
…
}
A query on any collection in DB2, using the index, which I confirmed via explain(), takes at least double the time that it does for the previous query against DB1.
Since mongo should scale well, why is there this diff? I read that mongodb loads all the indexes in memory, and since DB2 has a higher volume than DB1, is that why it’s taking much longer?
Any insights would be greatly helpful. Thanks.
Edit 1:
Adding more info re. collection definition, indexes definitions and queries executed...
All collections (in both DBs) contain the same fields; only the values and the size of documents differ between them.
And, here's the relevant index -
"1" : {
"v" : 1,
"unique" : true,
"key" : {
"id" : 1
},
"name" : "id_1",
"ns" : "ns.coll1"
}
And, this is how the id field looks like:
"_id" : ObjectId("55f9b6548aefbce6b2fa2fac"),
"id" : {
"pid" : {
"f1" : "val1",
"f2" : "val2"
}
},
And, here's a sample query -
db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}})
Edit 2:
Here's some more info on the hard disk & RAM -
$ free -m
total used free shared buff/cache available
Mem: 386758 3750 1947 25283 381060 355675
Swap: 131071 3194 127877
The hard disk is around 3.5T, out of which 2.9T is already used.

Scaling
MongoDB scales very well. The thing is, it is designed to scale horizontally, not vertically. This means that if your DBs are holding a lot of data, you should shard the collections in order to achieve better parallelization.
Benchmark results
Regarding the difference in query time, I don't think your profiling is conclusive. The DBs are possibly on different machines (with different specs). Supposing the hardware is the same, DB2 apparently holds more documents on its collections and the size of documents are not the same on both DBs. The same query can return data sets with different sizes. That will inevitably have impact on data serialization and other low level aspects. Unless you profile the queries in a more controlled setup, I think your results are pretty much expected.
Suggestions
Take care if you are using DRef on your documents. Its possible Mongo will automatically dereference them; that means more that data to serialize and overhead.
Try running the same queries with a limit specification. You have defined the index to be unique, but I don't know if that automatically makes Mongo stop index traversal once it has found a value. Check if db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}) and db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}).limit(1) run on the same time.
Take a look at Indexes on embedded fields and Indexes on embedded documents. Embedded documents seem to impair even extra overhead.
Finally, if your document has no embedded documents, only embedded fields (which seems to be the case), then define your index more specifically. Create this index
db.coll1.createIndex({"id.pid.f1": 1, "id.pid.f2": 1}, {unique: true})
and run the query again. If this index doesn't improve performance, then I believe you have done everything properly and it may be time to start sharding.

Related

MongoDB extra Collections

I have a db called index having only one collection named student.
When I fire query db.students.find({}).count()
It shows 1000000 docs in it.
But when I used db.stats() It shows result like:-
{
"db" : "index",
"collections" : 3,
"objects" : 1000004,
"avgObjSize" : 59.95997216011136,
"dataSize" : 59960212,
"storageSize" : 87420928,
"numExtents" : 14,
"indexes" : 1,
"indexSize" : 32458720,
"fileSize" : 520093696,
"nsSizeMB" : 16,
"ok" : 1
}
3 collections how ?
No of object 1000004 which is 4 extra from expected ?
And finally i did db.getCollectionNames()
it shows [ "student", "system.indexes" ]
What is system.indexes ?
Please anybody elaborate on it?
I am new to the world of mongo.
The mysterious 2 collections
There are two collections created when a user stores data in a database for the first time or a database is created explicitly.
The first one, system.indexes holds the information about the indices defined in the various collections of the database. You can even access it using
db.system.indexes.find()
The hidden one, system.namespaces holds some metadata about the database, actually the name of all existing entities from the point of view of the database management.
Although it is not shown, you can still access it:
db.system.namespaces.find()
Warning: Don't fiddle with either of them. Your database may well become unusable. You have been warned!
There can be even more than those two. Read System Collections in the MongoDB docs for details.
The mysterious 4 objects
Actually, If you have tried to access the system databases as shown above, this one becomes very easy. In a database called foobardb with a collection foo and the default index on _id, querying system.indexes will give a result like this (prettified):
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "foobardb.foo"
}
Note that this is a single document. The prettified output of the second query looks like this:
{ "name" : "foobardb.foo" }
{ "name" : "foobardb.system.indexes" }
{ "name" : "foobardb.foo.$_id_" }
Here, we have three documents. So we have 4 additional documents inside the metadata.

Insert operation became very slow for MongoDB

The client is pymongo.
The program has been running for one week. It's indeed very fast to insert data before: about 10 million / 30 minutes.
But today i found the insert operation became very very slow.
There are about 120 million records in the goods collection now.
> db.goods.count()
123535156
And the indexs for goods collection is as following:
db.goods.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "shop.goods",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"item_id" : 1,
"updated_at" : -1
},
"unique" : true,
"ns" : "shop.goods",
"name" : "item_id_1_updated_at_-1"
},
{
"v" : 1,
"key" : {
"updated_at" : 1
},
"ns" : "shop.goods",
"name" : "updated_at_1"
},
{
"v" : 1,
"key" : {
"item_id" : 1
},
"ns" : "shop.goods",
"name" : "item_id_1"
}
]
And there is enough RAM and CPU.
Someone told me because there are too many records. But didn't tell me how to solve this problem. I was a bit disappointed with the MongoDB.
There will be more data needs to be stored in future(about 50 million new records per day). Is there any solution?
Met same situation on another sever(Less data this time, total about 40 million), the current insert speed is about 5 records per second.
> db.products.stats()
{
"ns" : "c2c.products",
"count" : 42389635,
"size" : 554721283200,
"avgObjSize" : 13086.248164203349,
"storageSize" : 560415723712,
"numExtents" : 283,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0000000000132128,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 4257185968,
"indexSizes" : {
"_id_" : 1375325840,
"product_id_1" : 1687460992,
"created_at_1" : 1194399136
},
"ok" : 1
}
I don't know if it is your problem, but take in mind that MongoDB has to update index for each insert. So if you have many indexes, and many documents, performance could be lower than expected.
Maybe, you can speed up inserts operations using sharding. You don't mention it in your question, so I guess you are not using it.
Anyway, could you provide us more information? You can use db.goods.stats(), db.ServerStatus or any of theese other methods to gather information about performance of your database.
Another possible problem is IO. Depending on your scenario Mongo might be busy trying to grow or allocate storage files for the given namespace (i.e. DB) for the subsequent insert statements. If your test pattern has been add records / delete records / add records / delete records you are likely reusing existing allocated space. If your app is now running longer than before you might be in the situation I described.
Hope this sheds some light on your situation.
I had a very similar problem.
First you need to make sure which is your bottleneck (CPU, memory and Disk IO). I use several unix tools (such as top, iotop, etc) to detect the bottleneck. In my case I found insertion speed was lagged by IO speed because mongod often took 99% io usage. (Note: my original db used mmapv1 storage engine).
My work around was to change storage engine to wiredtiger. (either by mongodump your original db then mongorestore into wiredtiger format, or start a new mongod with wiredtiger engine and then resync from other replica set memebers.) My insertion speed went to normal after doing that.
However, I am still not sure why mongod with mmapv1 suddenly drained IO usages after the size of documents reached a point.

MongoDb performance slow even using index

We are trying to build a notification application for our users with mongo. we created 1 mongodb on 10GB RAM, 150GB SAS HDD 15K RPM, 4 Core 2.9GHZ xeon intel XEN VM.
DB schema :-
{
"_id" : ObjectId("5178c458e4b0e2f3cee77d47"),
"userId" : NumberLong(1574631),
"type" : 2,
"text" : "a user connected to B",
"status" : 0,
"createdDate" : ISODate("2013-04-25T05:51:19.995Z"),
"modifiedDate" : ISODate("2013-04-25T05:51:19.995Z"),
"metadata" : "{\"INVITEE_NAME\":\"2344\",\"INVITEE\":1232143,\"INVITE_SENDER\":1574476,\"INVITE_SENDER_NAME\":\"123213\"}",
"opType" : 1,
"actorId" : NumberLong(1574630),
"actorName" : "2344"
}
DB stats :-
db.stats()
{
"db" : "UserNotificationDev2",
"collections" : 3,
"objects" : 78597973,
"avgObjSize" : 489.00035699393925,
"dataSize" : 38434436856,
"storageSize" : 41501835008,
"numExtents" : 42,
"indexes" : 2,
"indexSize" : 4272393328,
"fileSize" : 49301946368,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
index :- userid and _id
we are trying to select latest 21 notifications for one user.
db.userNotification.find({ "userId" : 53 }).limit(21).sort({ "_id" : -1 });
but this query is taking too much time.
Fri Apr 26 05:39:55.563 [conn156] query UserNotificationDev2.userNotification query: { query: { userId: 53 }, orderby: { _id: -1 } } cursorid:225321382318166794 ntoreturn:21 ntoskip:0 nscanned:266025 keyUpdates:0 numYields: 2 locks(micros) r:4224498 nreturned:21 reslen:10295 2581ms
even count is taking hell lot of time.
Fri Apr 26 05:47:46.005 [conn159] command UserNotificationDev2.$cmd command: { count: "userNotification", query: { userId: 53 } } ntoreturn:1 keyUpdates:0 numYields: 11 locks(micros) r:9753890 reslen:48 5022ms
Are we doing some wrong in the query?
Please help!!!
Also suggest if our schema is not correct to storing user notifications. we have tried a embedded notifications like user and then notification for that user under that document but document limit is limiting us to store only ~50k notifications. so we changed to this.
You are querying by userId but not indexing it anywhere. My suggestion is to create an index on { "userId" : 1, "_id" : -1 }. This will create an index tree that starts with userId, then _id, which is almost exactly what your query is doing. This is the simplest/most flexible way to speeding up your query.
Another, more memory efficient, approach is to store your userId and timestamp as a string in _id, like _id : "USER_ID:DATETIME. Ex :
{_id : "12345:20120501123000"}
{_id : "15897:20120501124000"}
{_id : "15897:20120501125000"}
Notice _id is a string, not MongoId. Then your query above becomes a regex :
db.userNotification.find({ "_id" : /^53:/ }).limit(21).sort({ "_id" : -1 });
As expected, this will return all notifications for userId 53 in descending order. The memory efficient part is two fold:
You only need one index field. (Indexes compete with data for memory and are often several gigs in size)
If your queries are often about fetching newer data Right Balanced indexes keep your most often working in memory when the indexes are too large to fit whole.
Re: count. Count does take time because it scans through the entire collection.
Re: your schema. I'm guessing for your data set this is the best way to utilize your memory. When objects get large and your queries scan across multiple objects they will need to be loaded into memory in their entirety (I've had the OOM killer kill my mongod instance when i sorted with 2000 2MB objects on a 2GB RAM machine). With large objects your RAM usage will fluctuate greatly (not to mention they are limited upto a point). With your current schema mongo will have a much easier time loading only the data you're querying, resulting in less swapping and more consistent memory usage patterns.
One option is to try sharding then you can distribute notifications evenly between shards so when you need to select you will scan smaller subset of data. Need to decide however what your sharding will be using. To me it looks like operationType or userName but I do not know your data well enough. Another thing is why do you sort by _id?
I have just tried to replicate your problem. Created 140.000.000 inserts in userNotifications.
Without index on userId I got responses of 3-4seconds. After I created index on userId time dropped to almost instant responses.
db.userNotifications.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.userNotifications",
"name" : "id"
},
{
"v" : 1,
"key" : {
"userId" : 1
},
"ns" : "test.userNotifications",
"name" : "userId_1"
}
]
Another thing is: When your select happens is system writing constantly to mongo userNotification collection? Mongo locks whole collection if that happens. If it is the case
I would split read and writes between master and slave (see replication) and also do some sharding. Btw. What language you use for your app?
The most important thing is that you currently don't seem to have an index to support the query for user's latest notifications.
You need a compound index on userId, _id. This will support queries which only query by userId, but they also are used by queries by userId which sort/limit by _id.
When you add {userId:1, _id:-1} index don't forget to drop index on just userId as it will become redundant.
As far as count() make sure you are using 2.4.3 (the latest version) there were significant improvements in how count() uses indexes which resulted in much better performance.

Capped Collection Performance Issues

I'm doing some tests to see what kind of throughput I can get from Mongodb. The documentation says that capped collections are the fastest option. But I often find that I can write to a normal collection much faster. Depending on the exact test, I can often get twice the throughput with a normal collection.
Am I missing something? How do I troubleshoot this?
I have a very simple C++ program that writes about 64,000 documents to a collection as fast as possible. I record the total time, and the time that I'm waiting for the database. If I change nothing but the collection name, I can see a clear difference between the capped and normal collections.
> use tutorial
switched to db tutorial
> db.system.namespaces.find()
{ "name" : "tutorial.system.indexes" }
{ "name" : "tutorial.persons.$_id_" }
{ "name" : "tutorial.persons" }
{ "name" : "tutorial.persons.$age_1" }
{ "name" : "tutorial.alerts.$_id_" }
{ "name" : "tutorial.alerts" }
{ "name" : "tutorial.capped.$_id_" }
{ "name" : "tutorial.capped", "options" : { "create" : "capped", "capped" : true, "size" : 100000000 } }
> db.alerts.stats()
{
"ns" : "tutorial.alerts",
"count" : 400000,
"size" : 561088000,
"avgObjSize" : 1402.72,
"storageSize" : 629612544,
"numExtents" : 16,
"nindexes" : 1,
"lastExtentSize" : 168730624,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 12991664,
"indexSizes" : {
"_id_" : 12991664
},
"ok" : 1
}
> db.capped.stats()
{
"ns" : "tutorial.capped",
"count" : 62815,
"size" : 98996440,
"avgObjSize" : 1576,
"storageSize" : 100003840,
"numExtents" : 1,
"nindexes" : 1,
"lastExtentSize" : 100003840,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 2044000,
"indexSizes" : {
"_id_" : 2044000
},
"capped" : true,
"max" : 2147483647,
"ok" : 1
}
linux version: 3.4.11-1.fc16.x86_64
mongo version: db version v2.2.2, pdfile version 4.5
This is a dedicated machine doing nothing but running the Mongodb server and my test client. The machine is ridiculously overpowered for this test.
I see the problem. The web page I quoted above says that a capped collection "without an index" will offer high performance. But…
http://docs.mongodb.org/manual/core/indexes/ says "Before version 2.2 capped collections did not have an _id field. In 2.2, all capped collections have an _id field, except those in the local database."
I created another version of my test which writes to a capped collection in the local database. Sure enough, this collection did not have any indexes, and my throughput was much higher!
Perhaps the overview of capped collections at http://docs.mongodb.org/manual/core/capped-collections/ should clarify this point.
Capped collections guarantee preservation of the insertion order. As a
result, queries do not need an index to return documents in insertion
order. Without this indexing overhead, they can support higher
insertion throughput.
According to the above definition, if you don't have any index insertion to capped collections does not have to be faster than inserting to a normal collection. So if you don't have any indexes and if you don't have any other reason to go with capped collection such as caching, showing last n elements kind a stuff I would suggest you to go with regular collections.
Capped collections guarantee that insertion order is identical to the
order on disk (natural order) and do so by prohibiting updates that
increase document size. Capped collections only allow updates that fit
the original document size, which ensures a document does not change
its location on disk.

MongoDB - too much data for sort() with no index error

I am using MongoDB 1.6.3, to store a big collection (300k+ records). I added a composite index.
db['collection_name'].getIndexes()
[
{
"name" : "_id_",
"ns" : "db_name.event_logs",
"key" : {
"_id" : 1
}
},
{
"key" : {
"updated_at.t" : -1,
"community_id" : 1
},
"ns" : "db_name.event_logs",
"background" : true,
"name" : "updated_at.t_-1_community_id_1"
}
]
However, when I try to run this code:
db['collection_name']
.find({:community_id => 1})
.sort(['updated_at.t', -1])
.skip(#skip)
.limit(#limit)
I am getting:
Mongo::OperationFailure (too much data
for sort() with no index. add an
index or specify a smaller limit)
What am I doing wrong?
Try adding {community_id: 1, 'updated_at.t': -1} index. It needs to search by community_id first and then sort.
So it "feels" like you're using the index, but the index is actually a composite index. I'm not sure that the sort is "smart enough" to use only the partial index.
So two problems:
Based on your query, I would put community_id as the first part of the index, not the second. updated_at.t sounds like a field on which you'll do range queries. Indexes work better if the range query is the second bit.
How many entries are going to come back from community_id => 1? If the number is not big, you may be able to get away with just sorting without an index.
So you may have to switch the index around and you may have to change the sort to use both community_id and updated_at.t. I know it seems redundant, but start there and check the Google Groups if it's still not working.
Even with an index, I think you can still get that error if your result set exceeds 4MB.
You can see the size by going into the mongodb console and doing this:
show dbs
# pick yours (e.g., production)
use db-production
db.articles.stats()
I ended up with results like this:
{
"ns" : "mdalert-production.encounters",
"count" : 89077,
"size" : 62974416,
"avgObjSize" : 706.9660630690302,
"storageSize" : 85170176,
"numExtents" : 8,
"nindexes" : 6,
"lastExtentSize" : 25819648,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 18808832,
"indexSizes" : {
"_id_" : 3719168,
"patient_num_1" : 3440640,
"msg_timestamp_1" : 2981888,
"practice_id_1" : 2342912,
"patient_id_1" : 3342336,
"msg_timestamp_-1" : 2981888
},
"ok" : 1
}
Having a cursor batch size that is too large will cause this error. Setting the batch size does not limit the amount of data you can process, it just limits how much data is brought back from the database. When you iterate through and hit the batch limit, the process will make another trip to the database.