Capped Collection Performance Issues - mongodb

I'm doing some tests to see what kind of throughput I can get from Mongodb. The documentation says that capped collections are the fastest option. But I often find that I can write to a normal collection much faster. Depending on the exact test, I can often get twice the throughput with a normal collection.
Am I missing something? How do I troubleshoot this?
I have a very simple C++ program that writes about 64,000 documents to a collection as fast as possible. I record the total time, and the time that I'm waiting for the database. If I change nothing but the collection name, I can see a clear difference between the capped and normal collections.
> use tutorial
switched to db tutorial
> db.system.namespaces.find()
{ "name" : "tutorial.system.indexes" }
{ "name" : "tutorial.persons.$_id_" }
{ "name" : "tutorial.persons" }
{ "name" : "tutorial.persons.$age_1" }
{ "name" : "tutorial.alerts.$_id_" }
{ "name" : "tutorial.alerts" }
{ "name" : "tutorial.capped.$_id_" }
{ "name" : "tutorial.capped", "options" : { "create" : "capped", "capped" : true, "size" : 100000000 } }
> db.alerts.stats()
{
"ns" : "tutorial.alerts",
"count" : 400000,
"size" : 561088000,
"avgObjSize" : 1402.72,
"storageSize" : 629612544,
"numExtents" : 16,
"nindexes" : 1,
"lastExtentSize" : 168730624,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 12991664,
"indexSizes" : {
"_id_" : 12991664
},
"ok" : 1
}
> db.capped.stats()
{
"ns" : "tutorial.capped",
"count" : 62815,
"size" : 98996440,
"avgObjSize" : 1576,
"storageSize" : 100003840,
"numExtents" : 1,
"nindexes" : 1,
"lastExtentSize" : 100003840,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 2044000,
"indexSizes" : {
"_id_" : 2044000
},
"capped" : true,
"max" : 2147483647,
"ok" : 1
}
linux version: 3.4.11-1.fc16.x86_64
mongo version: db version v2.2.2, pdfile version 4.5
This is a dedicated machine doing nothing but running the Mongodb server and my test client. The machine is ridiculously overpowered for this test.

I see the problem. The web page I quoted above says that a capped collection "without an index" will offer high performance. But…
http://docs.mongodb.org/manual/core/indexes/ says "Before version 2.2 capped collections did not have an _id field. In 2.2, all capped collections have an _id field, except those in the local database."
I created another version of my test which writes to a capped collection in the local database. Sure enough, this collection did not have any indexes, and my throughput was much higher!
Perhaps the overview of capped collections at http://docs.mongodb.org/manual/core/capped-collections/ should clarify this point.

Capped collections guarantee preservation of the insertion order. As a
result, queries do not need an index to return documents in insertion
order. Without this indexing overhead, they can support higher
insertion throughput.
According to the above definition, if you don't have any index insertion to capped collections does not have to be faster than inserting to a normal collection. So if you don't have any indexes and if you don't have any other reason to go with capped collection such as caching, showing last n elements kind a stuff I would suggest you to go with regular collections.
Capped collections guarantee that insertion order is identical to the
order on disk (natural order) and do so by prohibiting updates that
increase document size. Capped collections only allow updates that fit
the original document size, which ensures a document does not change
its location on disk.

Related

MongoDB - performance and collection size

I have a question with regards to collection sizes and query performance –
There are 2 dbs– DB1 & DB2. DB1 has 1 collection, and here’s the output from stats() on this collection –
{
…
"count" : 2085217,
"size" : 17048734192,
"avgObjSize" : 8176,
"capped" : false,
"nindexes" : 3,
"indexDetails" : {},
"totalIndexSize" : 606299456,
"indexSizes" : {
"_id_" : 67664576,
"id_1" : 284165056,
"id_2" : 254469824
},
…
}
A query on this collection, using index id_1 comes back in 0.012 secs. Here’s the output from explain() -
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
….
"indexName" : "id_1",
}
In DB2, I have 4 collections, and here’s the output from stats on DB2 –
{
…
"collections" : 4,
"objects" : 152655935,
"avgObjSize" : 8175.998307514215,
"dataSize" : 1248114666192,
"storageSize" : 1257144933456,
"indexes" : 12,
"indexSize" : 19757688272,
"fileSize" : 1283502112768,
…
}
A query on any collection in DB2, using the index, which I confirmed via explain(), takes at least double the time that it does for the previous query against DB1.
Since mongo should scale well, why is there this diff? I read that mongodb loads all the indexes in memory, and since DB2 has a higher volume than DB1, is that why it’s taking much longer?
Any insights would be greatly helpful. Thanks.
Edit 1:
Adding more info re. collection definition, indexes definitions and queries executed...
All collections (in both DBs) contain the same fields; only the values and the size of documents differ between them.
And, here's the relevant index -
"1" : {
"v" : 1,
"unique" : true,
"key" : {
"id" : 1
},
"name" : "id_1",
"ns" : "ns.coll1"
}
And, this is how the id field looks like:
"_id" : ObjectId("55f9b6548aefbce6b2fa2fac"),
"id" : {
"pid" : {
"f1" : "val1",
"f2" : "val2"
}
},
And, here's a sample query -
db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}})
Edit 2:
Here's some more info on the hard disk & RAM -
$ free -m
total used free shared buff/cache available
Mem: 386758 3750 1947 25283 381060 355675
Swap: 131071 3194 127877
The hard disk is around 3.5T, out of which 2.9T is already used.
Scaling
MongoDB scales very well. The thing is, it is designed to scale horizontally, not vertically. This means that if your DBs are holding a lot of data, you should shard the collections in order to achieve better parallelization.
Benchmark results
Regarding the difference in query time, I don't think your profiling is conclusive. The DBs are possibly on different machines (with different specs). Supposing the hardware is the same, DB2 apparently holds more documents on its collections and the size of documents are not the same on both DBs. The same query can return data sets with different sizes. That will inevitably have impact on data serialization and other low level aspects. Unless you profile the queries in a more controlled setup, I think your results are pretty much expected.
Suggestions
Take care if you are using DRef on your documents. Its possible Mongo will automatically dereference them; that means more that data to serialize and overhead.
Try running the same queries with a limit specification. You have defined the index to be unique, but I don't know if that automatically makes Mongo stop index traversal once it has found a value. Check if db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}) and db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}).limit(1) run on the same time.
Take a look at Indexes on embedded fields and Indexes on embedded documents. Embedded documents seem to impair even extra overhead.
Finally, if your document has no embedded documents, only embedded fields (which seems to be the case), then define your index more specifically. Create this index
db.coll1.createIndex({"id.pid.f1": 1, "id.pid.f2": 1}, {unique: true})
and run the query again. If this index doesn't improve performance, then I believe you have done everything properly and it may be time to start sharding.

MongoDB count with query return more records than a count all

I've noticed a strange behaviour of MongoDB and I'm try to guess what can be the problem:
I've a MongoDB with a lot of documents inside a collection. I've run the following query:
db.mydocuments.count({_id:{$lte:new ObjectId("549010c9e4b06c2f044f27f4")}});
The result is 66.579.389 documents
Than I ran the following:
db.mydocuments.count();
and surprisingly I've got the following total: 32.606.242
How this can be? how the whole count of a collection can be less than a count with a query?At least it need to be equal to the query count.
The db.mydocument.stats() is:
{
"ns" : "mydb.documents.photos",
"count" : 32606242,
"size" : 76109891776,
"avgObjSize" : 2334,
"storageSize" : 164665658240,
"numExtents" : 97,
"nindexes" : 1,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 1944138336,
"indexSizes" : {
"_id_" : 1944138336
},
"ok" : 1
}
.stats() returns you amount of allocated records as well as average object size, extents etc.
This call is not supposed to get exact information about your collection (like a number of documents).
Example:
you inserted 100 docs, then deleted 99. The collection is still allocated to handle 100 docs.
Do not rely on .stats() call unless you need it for informational purposes only.
In your case, we have exact opposite behavior. May be, broken index.

Insert operation became very slow for MongoDB

The client is pymongo.
The program has been running for one week. It's indeed very fast to insert data before: about 10 million / 30 minutes.
But today i found the insert operation became very very slow.
There are about 120 million records in the goods collection now.
> db.goods.count()
123535156
And the indexs for goods collection is as following:
db.goods.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "shop.goods",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"item_id" : 1,
"updated_at" : -1
},
"unique" : true,
"ns" : "shop.goods",
"name" : "item_id_1_updated_at_-1"
},
{
"v" : 1,
"key" : {
"updated_at" : 1
},
"ns" : "shop.goods",
"name" : "updated_at_1"
},
{
"v" : 1,
"key" : {
"item_id" : 1
},
"ns" : "shop.goods",
"name" : "item_id_1"
}
]
And there is enough RAM and CPU.
Someone told me because there are too many records. But didn't tell me how to solve this problem. I was a bit disappointed with the MongoDB.
There will be more data needs to be stored in future(about 50 million new records per day). Is there any solution?
Met same situation on another sever(Less data this time, total about 40 million), the current insert speed is about 5 records per second.
> db.products.stats()
{
"ns" : "c2c.products",
"count" : 42389635,
"size" : 554721283200,
"avgObjSize" : 13086.248164203349,
"storageSize" : 560415723712,
"numExtents" : 283,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1.0000000000132128,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 4257185968,
"indexSizes" : {
"_id_" : 1375325840,
"product_id_1" : 1687460992,
"created_at_1" : 1194399136
},
"ok" : 1
}
I don't know if it is your problem, but take in mind that MongoDB has to update index for each insert. So if you have many indexes, and many documents, performance could be lower than expected.
Maybe, you can speed up inserts operations using sharding. You don't mention it in your question, so I guess you are not using it.
Anyway, could you provide us more information? You can use db.goods.stats(), db.ServerStatus or any of theese other methods to gather information about performance of your database.
Another possible problem is IO. Depending on your scenario Mongo might be busy trying to grow or allocate storage files for the given namespace (i.e. DB) for the subsequent insert statements. If your test pattern has been add records / delete records / add records / delete records you are likely reusing existing allocated space. If your app is now running longer than before you might be in the situation I described.
Hope this sheds some light on your situation.
I had a very similar problem.
First you need to make sure which is your bottleneck (CPU, memory and Disk IO). I use several unix tools (such as top, iotop, etc) to detect the bottleneck. In my case I found insertion speed was lagged by IO speed because mongod often took 99% io usage. (Note: my original db used mmapv1 storage engine).
My work around was to change storage engine to wiredtiger. (either by mongodump your original db then mongorestore into wiredtiger format, or start a new mongod with wiredtiger engine and then resync from other replica set memebers.) My insertion speed went to normal after doing that.
However, I am still not sure why mongod with mmapv1 suddenly drained IO usages after the size of documents reached a point.

Why are my mongodb indexes so large

I have 57M documents in my mongodb collection, which is 19G of data.
My indexes are taking up 10G. Does this sound normal or could I be doing something very wrong! My primary key is 2G.
{
"ns" : "myDatabase.logs",
"count" : 56795183,
"size" : 19995518140,
"avgObjSize" : 352.0636272974065,
"storageSize" : 21217578928,
"numExtents" : 39,
"nindexes" : 4,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 10753999088,
"indexSizes" : {
"_id_" : 2330814080,
"type_1_playerId_1" : 2999537296,
"type_1_time_-1" : 2344582464,
"type_1_tableId_1" : 3079065248
},
"ok" : 1
}
The index size is determined by the number of documents being indexed, as well as the size of the key (compound keys store more information and will be larger). In this case, the _id index divided by the number of documents is 40 bytes, which seems relatively reasonable.
If you run db.collection.getIndexes(), you can find the index version. If {v : 0}, the index was created prior to mongo 2.0, in which case you should upgrade to {v:1}. This process is documented here: http://www.mongodb.org/display/DOCS/Index+Versions

MongoDB - too much data for sort() with no index error

I am using MongoDB 1.6.3, to store a big collection (300k+ records). I added a composite index.
db['collection_name'].getIndexes()
[
{
"name" : "_id_",
"ns" : "db_name.event_logs",
"key" : {
"_id" : 1
}
},
{
"key" : {
"updated_at.t" : -1,
"community_id" : 1
},
"ns" : "db_name.event_logs",
"background" : true,
"name" : "updated_at.t_-1_community_id_1"
}
]
However, when I try to run this code:
db['collection_name']
.find({:community_id => 1})
.sort(['updated_at.t', -1])
.skip(#skip)
.limit(#limit)
I am getting:
Mongo::OperationFailure (too much data
for sort() with no index. add an
index or specify a smaller limit)
What am I doing wrong?
Try adding {community_id: 1, 'updated_at.t': -1} index. It needs to search by community_id first and then sort.
So it "feels" like you're using the index, but the index is actually a composite index. I'm not sure that the sort is "smart enough" to use only the partial index.
So two problems:
Based on your query, I would put community_id as the first part of the index, not the second. updated_at.t sounds like a field on which you'll do range queries. Indexes work better if the range query is the second bit.
How many entries are going to come back from community_id => 1? If the number is not big, you may be able to get away with just sorting without an index.
So you may have to switch the index around and you may have to change the sort to use both community_id and updated_at.t. I know it seems redundant, but start there and check the Google Groups if it's still not working.
Even with an index, I think you can still get that error if your result set exceeds 4MB.
You can see the size by going into the mongodb console and doing this:
show dbs
# pick yours (e.g., production)
use db-production
db.articles.stats()
I ended up with results like this:
{
"ns" : "mdalert-production.encounters",
"count" : 89077,
"size" : 62974416,
"avgObjSize" : 706.9660630690302,
"storageSize" : 85170176,
"numExtents" : 8,
"nindexes" : 6,
"lastExtentSize" : 25819648,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 18808832,
"indexSizes" : {
"_id_" : 3719168,
"patient_num_1" : 3440640,
"msg_timestamp_1" : 2981888,
"practice_id_1" : 2342912,
"patient_id_1" : 3342336,
"msg_timestamp_-1" : 2981888
},
"ok" : 1
}
Having a cursor batch size that is too large will cause this error. Setting the batch size does not limit the amount of data you can process, it just limits how much data is brought back from the database. When you iterate through and hit the batch limit, the process will make another trip to the database.