Capped collections and _id auto index - mongodb

What is the rationale for having auto index on _id field for capped collections by default? We can find in the docs that:
Without this indexing overhead, capped collections can support higher insertion throughput.
There was a post about capped collection insert performance and my own tests also show that for inserts capped collection without an index is the fastest option, then the normal collection goes, and the slowest option is capped collection with an index. So why auto index was added along with _id fields in version 2.2 if it hits performance while capped collections are proposed as fast alternatives to normal collections in certain scenarios?

Well, we certainly can't rule out benefits of _id in capped collection too. It help you and in fact required for replication.
MongoDB turned it on by default since deployment of MongoDB in replica set configuration is very normal now a days. You can find more information in documentation, please look for autoIndexId
I believe, the reason of slowness is Index, not _id field itself. So if your requirements warrant special needs, you can always disable auto index.
But...
you still need to supply _id field with zero (0) value.
e.g. 2 GB capped collection with auto index disabled.
db.createCollection("people", { capped: true, size: 2147483648, autoIndexId: false } )
I'm sure this trick will bring insertion speed back.

Related

Indexing in MongoDB [duplicate]

I need to know abt how indexing in mongo improve query performance. And currently my db is not indexed. How can i index an existing DB.? Also is i need to create a new field only for indexing.?.
Fundamentally, indexes in MongoDB are similar to indexes in other database systems. MongoDB supports indexes on any field or sub-field contained in documents within a MongoDB collection.
Indexes are covered in detail here and I highly recommend reading this documentation.
There are sections on indexing operations, strategies and creation options as well as a detailed explanations on the various indexes such as compound indexes (i.e. an index on multiple fields).
One thing to note is that by default, creating an index is a blocking operation. Creating an index is as simple as:
db.collection.ensureIndex( { zip: 1})
Something like this will be returned, indicating the index was correctly inserted:
Inserted 1 record(s) in 7ms
Building an index on a large collection of data, the operation can take a long time to complete. To resolve this issue, the background option can allow you to continue to use your mongod instance during the index build.
Limitations on indexing in MongoDB is covered here.

Should capped collection or collection with TTL indexes still be compacted regularly?

We have a write heavy workflow which handling with mongodb. And we often delete big portions of data. Because mongodb do not really delete data unless you invoke compact operation we sometimes faced with disk memory problems.
So my question is the following - If i will use TTL indexes or capped collections - I still need to invoke 'compact' operation to clean the space for real or it will be handle this automatically?
For capped collection compact is meaningless because the collection is of fixed size. Also taking natural order into mind, documents in a capped collection don't move, have no padding and can't grow in size (in fact such operations where documents may grow will fail), hence there is no fragmentation.
For collection with TTL index, no problem. In fact more fragmentation can be expected in such collections.

Mongodb model for Uniqueness

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses

Duplicate documents on _id (in mongo)

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).
I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.
My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.
I've removed the duplicates, but others still appear.
Do you have any ideas where could they come from, or what should I start looking at?
(Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).
This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.
In the MongoDB: Configuring Sharding documentation there is specific mention that:
When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.
If the "unique: true" option is not used, the shard key does not have to be unique.
How have you implemented generating the integer Ids?
If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:
function counter(name) {
var ret = db.counters.findAndModify({
query:{_id:name},
update:{$inc:{next:1}},
"new":true,
upsert:true});
return ret.next;
}
db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2
If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.

Mongodb natural sort on non-capped collections, how wrong is it?

The mongo docs explains that natural sort is not guaranteed to work in non-capped collections
http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
But how wrong is it? For non critical usecases, a .1% inaccuracy is totally fine, especially if there are performance / size savings.
Thanks.
There is nothing wrong with using the $natural sort (order) for non-capped collections.
The meaning of $natural is dramatically different on a capped collection and a normal one (where updates/removes can occur). With a regular collection the order of the documents may change over time.
If you want to return the documents in order of insertion then the $natural index (not really an index) is not useful on anything but a capped collection. This is cause only capped collections require that no documents can be removed or moved within the collection.
As said and documented: you have no guarantees and therefore no numbers can be given.