How can I specify the natural ordering in MongoDB? - mongodb

Is there a way to specify the natural ordering of data in mongodb, similar to how a primary index would order data in a RDBMS table?
My use case is that all my queries return data sorted by a date field, say birthday. According to MongoDB: Sorting and Natural Order, the natural order for a standard (non-capped) collection is roughly the insertion order, but not guaranteed. This would imply sorting is needed after the data is retrieved.

I believe what you are referring to is a clustered index, not a primary index.
MongoDB 2.0 does not have a clustered index feature, so a regular index on date would be your most efficient option to retrieve.
It's probably premature optimization to think about the physical order on disk with MongoDB. MongoDB uses memory-mapped files, so depending on your working set + queries + RAM you may not need to load data from disk as often as expected.

If you are looking for something to act like the primary index in a RDBMS then sort by _id. It will be roughly insert order since the _id is prefixed with timestamp. If you try to use $natural order it will cause it to miss indexes.

Also, I would add that you should look into using the built-in timestamps in the document IDs instead of relying on a separate date field, as it allows you to store less data and removes an index.
Jason
MongoHQ

I guess it would be difficult to achieve what you want without the help of indexes. To support sharding, the _id field in MongoDB takes values based on the timestamp at the moment the document is created. As a consequence, you can't have them monotonically increasing unlike the identity column in RDBMS table..I think you must create an index on Birthdate column if all your queries return documents sorted in the order of Birthdate. Once the index is created, the queries become efficient enough..
Refer this:
MongoDB capped collection and monotically increasing index

Related

Mongo DB update query performance

I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().

How do I sort a MongoDB collection in MeteorJS permanently?

From the tutorials out there I know that I can sort a MongoDB collection in meteor on request like this:
// Sorted by createdAt descending
Users.find({}, {sort: {createdAt: -1}})
But I feel like this solution is not optimal in the view of performance.
Because if I understand it right, every time there is a request for Users, the raw collection is requested and then sorted over and over again.
So wouldn't it be better to sort the whole collection once and for all and then access the already sorted collection with Users.find()?
The question is: How do I sort the whole collection permanently not just the found results?
This is a known limitation of MiniMongo, Meteor's client-side implementation of (a subset of) the MongoDB functionality.
"Sorting" a MongoDB collection does not really have a coherent meaning. It does not translate into a concrete set of operations. What would you sort it by? Is there a "natural" way to sort a set of documents which structure may vary?
The mechanism that is used for making data retrieval more efficient is an index. On the server, indices are used to assist sorting, if possible:
In MongoDB, sort operations can obtain the sort order by retrieving documents based on the ordering in an index. If the query planner cannot obtain the sort order from an index, it will sort the results in memory. Sort operations that use an index often have better performance than those that do not use an index. In addition, sort operations that do not use an index will abort when they use 32 megabytes of memory.
(Source: MongoDB documentation)
As a collection does not have an inherent order to it, the entity that holds information about the order requirements in MongoDB is a Cursor. A cursor can be fetched multiple times, and in theory could be made into an efficient ordered data fetcher.
Unfortunately, this is not the case at the moment. The way it is currently implemented, MiniMongo does not have indices and does not cache the documents by order. They are re-sorted every time the cursor is fetched.
The sorting is reasonably efficient (as much as sorting can be efficient, O(n*logn) sort function invocations), but for a large data set, it could be fairly lengthy and degrade the user experience.
At the moment, if you have a special use case that requires repeated access to a large data-set that is ordered the same way, you could try to keep a cache of ordered documents if you need to, by observing the cursor and updating the cache when there are changes.

MongoDB - Compound Secondary Index vs Concatenated _id Index

I am designing my database with MongoDb thinking in the scalability in the future. My main concern right now is about representing the indexes, as I have read, it is a crucial factor while scaling huge collections, in terms of RAM consumption, and sharding efficiency.
For simplicity, I have two different collections. A user collection which stores the user username, email, and some metadata, and a devices collection, that contains a device name, some metadata, and should be related with its owner. One user can have millions of devices (so it is not worth to store all in a single user document).
The devices collection should support queries in term of the whole device identifier by (username, device_name), or also by the username.
In this case I see some different approaches for storing the indexes:
Use a secondary compound index with username and device_name (in this order)
Use a primary index with and _id containing an string with username#device_name
Use an object in the _id field with both values {owner:username, device:device_name}
For testing this indexes, I have done some server load. I have created three different collections with this different alternatives and filled 5M documents. Some data:
I do not use the automatically generated _id created by mongo, as all my queries requires username/device. So this approach takes some extra space for indexing. The index size is 524MB. It is efficient while querying both by user or by user/device.
As I am replacing the _id with my own string, the index takes less space. In this case 352MB. I am still able to query efficiently by user (with a regex like /^username#/ the explain() reports almost the same results like in 1 in), and by the exact username/device.
The _id index cannot be changed to a compound index, so it is required to create a secondary compound index with {_id.owner, _id.device}. This results in a huge index size of 1059MB!. Queries goes well as in previous cases.
So, I can discard alternative 3, as this is not so much efficient. Between alternative 1 and 2, I prefer 1 as this approach is more clean, but it uses a _id field I will not use. So at this moment, the winning approach seems to be the number 2, as it allows me query efficiently by username or username/device, and it also takes less index space.
Is there a good reason to not use number 2 and follow with number 1, like when selecting the sharding key? Is there something I am missing? I am new with mongoDB and do not want to have problems when scaling my schema.

MongoDB 3.X : Does it make sense to have only one collection per database

Since MongoDB 3.x introduces lock per record and not on collection or database, does it make sense to write all of your data to single collection with one extra identifier field "documentType".
It will help simulate "join" through map-reduce operation.
Couchbase does the same thing with "buckets" instead of collection.
Does anybody see any disadvatanges with this approach ?
There's one big general-case disadvantage: indexes.
With Mongo, you generally want to set up indexes so that most, if not all, queries you make, use them. So in addition to the one on _id, you'll set up indexes on the primary fields you search by (often compounded with those you sort by).
If you're storing everything in one single collection, that means you need to have all those indexes on that collection. Which means two things:
The indexes are be bigger, since there's more documents to index. Granted, this can be somewhat mitigated by using sparse indexes.
Inserting or modifying documents in the collection requires Mongo to update all these indexes (where it'd just update the relevant indexes in the standard use-many-collections approach). This kills your write performance.
Furthermore, if you have in your application a query that somehow doesn't use one of those many indexes, it needs to scan through the entire collection, which is O(n) where n is the number of documents in the collection -- in your case, that means the number of documents in the entire database.
Collections are cheap. Use them ;)

MongoDB and composite primary keys

I'm trying to determine the best way to deal with a composite primary key in a mongo db. The main key for interacting with the data in this system is made up of 2 uuids. The combination of uuids is guaranteed to be unique, but neither of the individual uuids is.
I see a couple of ways of managing this:
Use an object for the primary key that is made up of 2 values (as suggested here)
Use a standard auto-generated mongo object id as the primary key, store my key in two separate fields, and then create a composite index on those two fields
Make the primary key a hash of the 2 uuids
Some other awesome solution that I currently am unaware of
What are the performance implications of these approaches?
For option 1, I'm worried about the insert performance due to having non sequential keys. I know this can kill traditional RDBMS systems and I've seen indications that this could be true in MongoDB as well.
For option 2, it seems a little odd to have a primary key that would never be used by the system. Also, it seems that query performance might not be as good as in option 1. In a traditional RDBMS a clustered index gives the best query results. How relevant is this in MongoDB?
For option 3, this would create one single id field, but again it wouldn't be sequential when inserting. Are there any other pros/cons to this approach?
For option 4, well... what is option 4?
Also, there's some discussion of possibly using CouchDB instead of MongoDB at some point in the future. Would using CouchDB suggest a different solution?
MORE INFO: some background about the problem can be found here
You should go with option 1.
The main reason is that you say you are worried about performance - using the _id index which is always there and already unique will allow you to save having to maintain a second unique index.
For option 1, I'm worried about the insert performance do to having
non sequential keys. I know this can kill traditional RDBMS systems
and I've seen indications that this could be true in MongoDB as well.
Your other options do not avoid this problem, they just shift it from the _id index to the secondary unique index - but now you have two indexes, once that's right-balanced and the other one that's random access.
There is only one reason to question option 1 and that is if you plan to access the documents by just one or just the other UUID value. As long as you are always providing both values and (this part is very important) you always order them the same way in all your queries, then the _id index will be efficiently serving its full purpose.
As an elaboration on why you have to make sure you always order the two UUID values the same way, when comparing subdocuments { a:1, b:2 } is not equal to { b:2, a:1 } - you could have a collection where two documents had those values for _id. So if you store _id with field a first, then you must always keep that order in all of your documents and queries.
The other caution is that index on _id:1 will be usable for query:
db.collection.find({_id:{a:1,b:2}})
but it will not be usable for query
db.collection.find({"_id.a":1, "_id.b":2})
I have an option 4 for you:
Use the automatic _id field and add 2 single field indexes for both uuid's instead of a single composite index.
The _id index would be sequential (although that's less important in MongoDB), easily shardable, and you can let MongoDB manage it.
The 2 uuid indexes let you to make any kind of query you need (with the first one, with the second or with both in any order) and they take up less space than 1 compound index.
In case you use both indexes (and other ones as well) in the same query MongoDB will intersect them (new in v2.6) as if you were using a compound index.
I'd go for the 2 option and there is why
Having two separate fields instead of the one concatenated from both uuids as suggested in 1st, will leave you the flexibility to create other combinations of indexes to support the future query requests or if turns out, that the cardinality of one key is higher then another.
having non sequential keys could help you to avoid the hotspots while inserting in sharded environment, so its not such a bad option. Sharding is the best way, for my opinion, to scale inserts and updates on the collections, since the write locking is on database level (prior to 2.6) or collection level (2.6 version)
I would've gone with option 2. You can still make an index that handles both the UUID fields, and performance should be the same as a compound primary key, except it'll be much easier to work with.
Also, in my experience, I've never regretted giving something a unique ID, even if it wasn't strictly required. Perhaps that's an unpopular opinion though.