Choosing the right shard key in MongoDB

Choosing the right shard key in MongoDB - mongodb

We are building our first MongoDB and currently we are trying to choose the right shard key.
Each document in our main collection contain around 40 voice call related fields and the main field that we use in queries is the UserId field. This is why we are thinking about compound shard key of userid and CallStartTime.
They are not sure regarding the second field since StartTime is always advancing and one might argue that it is not random enough. This led us to consider replace it with UserId and hashed _id (mongo internal id after hash).
Is the first option is ok or do we better use the latter?

Consider the recommendations in the documentation here: http://docs.mongodb.org/manual/core/sharded-cluster-internals/#shard-keys
Or, if there is no natural choice, consider using a hashed shard key (mongodb 2.4+)
http://docs.mongodb.org/manual/reference/glossary/#term-hashed-shard-key

What sort of queries are you performing? What are the access patterns.
Ideally you want a key with good cardinality, write scaling and query isolation.
In your examples above you would need to know the callstarttime or hash to avoid scatter-gather operations.

Related

Uniqueness of _id within a shard

I'm looking into sharding using mongodb, and most if it is rather straight forward. I have some experience with sharding in other databases, so I'm not asking about the concept itself. There's one thing I'm confused by, and there doesn't seem to be anything in the documentation about this, so here goes.
Is _id required to be unique within the shard, regardless of shard key?
A small scale (single shard) test seems to confirm that this is the case. It does however seem like a less than stellar approach to sharding, which has me confused. To me it would make more sense to require shard-key + _id to be unique (i.e. use a compound key), or you'll have inconsistent behavior depending on where your shard-keys end up being routed to. My data model uses deterministic keys, and the shard key is an intrinsic part of it. So I guess it comes down to, did I do something wrong in my small scale test? Do I need to store the shard-key twice, once as a shard-key field and once as part of _id? Or is there some special case where I can somehow declare a compound key using shard-key and _id?
Update
For completeness, this is the trivial case I'm testing, inserting the following two documents:
{"_id": 1, "shardkey": 1}
{"_id": 1, "shardkey": 2}
First one obviously goes through, second one fails. If I would've had two shards, and the shard keys would've been routed to different shards, I assume both would've succeeded.
I can obviously just combine the shard-key and the id to create the _id field for mongodb, since this is really the key I'm using, but it seems like a weird way to approach the problem from a database architectural standpoint.

_id needs to be unique, always, whether the collection is sharded or not. The shard key does not need to be unique. It is used to split the collection into chunks which can be split onto the shards making up the database. The shard key needs to provide enough granularity to split the documents in the collection into chunks. Its obviously a good idea to link the shard key to how you query the data, and use a shard key which relates to the fields that you query on. This way the queries you run will be easily directed to the relevant shards to satisfy the query. If the shard key isnt selective enough then the query will need to go to multiple shards to find the correct documents. You can create a compound index on _id + shard-key and make it unique if you want.
I realise this doesnt fully answer the question. tbh I am struggling to understand what you're asking. Perhaps if you could post an example of the documents you're storing and the queries you're running it would help.

Generating shard key field for multi tenant mongodb app

I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?

If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.

MongoDB and composite primary keys

I'm trying to determine the best way to deal with a composite primary key in a mongo db. The main key for interacting with the data in this system is made up of 2 uuids. The combination of uuids is guaranteed to be unique, but neither of the individual uuids is.
I see a couple of ways of managing this:
Use an object for the primary key that is made up of 2 values (as suggested here)
Use a standard auto-generated mongo object id as the primary key, store my key in two separate fields, and then create a composite index on those two fields
Make the primary key a hash of the 2 uuids
Some other awesome solution that I currently am unaware of
What are the performance implications of these approaches?
For option 1, I'm worried about the insert performance due to having non sequential keys. I know this can kill traditional RDBMS systems and I've seen indications that this could be true in MongoDB as well.
For option 2, it seems a little odd to have a primary key that would never be used by the system. Also, it seems that query performance might not be as good as in option 1. In a traditional RDBMS a clustered index gives the best query results. How relevant is this in MongoDB?
For option 3, this would create one single id field, but again it wouldn't be sequential when inserting. Are there any other pros/cons to this approach?
For option 4, well... what is option 4?
Also, there's some discussion of possibly using CouchDB instead of MongoDB at some point in the future. Would using CouchDB suggest a different solution?
MORE INFO: some background about the problem can be found here

You should go with option 1.
The main reason is that you say you are worried about performance - using the _id index which is always there and already unique will allow you to save having to maintain a second unique index.
For option 1, I'm worried about the insert performance do to having
non sequential keys. I know this can kill traditional RDBMS systems
and I've seen indications that this could be true in MongoDB as well.
Your other options do not avoid this problem, they just shift it from the _id index to the secondary unique index - but now you have two indexes, once that's right-balanced and the other one that's random access.
There is only one reason to question option 1 and that is if you plan to access the documents by just one or just the other UUID value. As long as you are always providing both values and (this part is very important) you always order them the same way in all your queries, then the _id index will be efficiently serving its full purpose.
As an elaboration on why you have to make sure you always order the two UUID values the same way, when comparing subdocuments { a:1, b:2 } is not equal to { b:2, a:1 } - you could have a collection where two documents had those values for _id. So if you store _id with field a first, then you must always keep that order in all of your documents and queries.
The other caution is that index on _id:1 will be usable for query:
db.collection.find({_id:{a:1,b:2}})
but it will not be usable for query
db.collection.find({"_id.a":1, "_id.b":2})

I have an option 4 for you:
Use the automatic _id field and add 2 single field indexes for both uuid's instead of a single composite index.
The _id index would be sequential (although that's less important in MongoDB), easily shardable, and you can let MongoDB manage it.
The 2 uuid indexes let you to make any kind of query you need (with the first one, with the second or with both in any order) and they take up less space than 1 compound index.
In case you use both indexes (and other ones as well) in the same query MongoDB will intersect them (new in v2.6) as if you were using a compound index.

I'd go for the 2 option and there is why
Having two separate fields instead of the one concatenated from both uuids as suggested in 1st, will leave you the flexibility to create other combinations of indexes to support the future query requests or if turns out, that the cardinality of one key is higher then another.
having non sequential keys could help you to avoid the hotspots while inserting in sharded environment, so its not such a bad option. Sharding is the best way, for my opinion, to scale inserts and updates on the collections, since the write locking is on database level (prior to 2.6) or collection level (2.6 version)

I would've gone with option 2. You can still make an index that handles both the UUID fields, and performance should be the same as a compound primary key, except it'll be much easier to work with.
Also, in my experience, I've never regretted giving something a unique ID, even if it wasn't strictly required. Perhaps that's an unpopular opinion though.

strategy for creating MongoDB short ids that scale

I want to have a friendlier facing ids (ie Youtube style: /posts/cxB6Ey6) than MongoDB's ObjectID.
I read that for scalability its best to leave _id as an ObjectID so I thought about two solutions:
1) add an indexed postid field to each document
2) create a mapping collection between _id and the postid
in both cases use something like https://github.com/dylang/shortid to generate the short id, and while generating make sure that the id is unique by querying the database.
(can this query-generate-insert be an atomic operation?)
will those solutions have a noticeable impact on performance ?
what's the best strategy for doing this ?

The normal method of doing this is to base64 encode a unique id but:
add an indexed postid field to each document
You definitely want to go for this method. Out of the two I would say this method is easily the most scalable and performant, for one it would only need one round trip to get a short URLs details where as the second option would take 2. Another consideration is the shortage of index overhead of maintaining an extra collection, this is a bit of a no-brainer.
I would not replace the _id field within the document either since the default ObjectId could still be useful in the foreseeable future.
So this limits it down to a separate field and index (unique key) for the short code of a URL.
The next thing is that you don't want an ID which forces you to query the database for uniqueness prior to every insert. This is where the ObjectId shines. The ObjectId is good at being made within the client application while being unique in the database without having to specifically query those assumptions.
Unique ids that do not require querying the database first are normally time based. In PHP ( http://php.net/manual/en/function.uniqid.php ) and in the MongoDB Drivers ( http://docs.mongodb.org/manual/core/object-id/ ) and even the plug-in you linked on github ( https://github.com/dylang/shortid/blob/master/lib/shortid.js#L50 ) they all use time as a basis for being unique.
Considering the plug-in you linked does not query the database to check its own IDs uniqueness I would say that this plug-in probably is quite performant and if you use it with the first solution you stated you should get a good benchmark out of it.

If you want to replace build-in ObjectID with custom user-friendly short id's then do it. You can either use build-in _id field or add a new unique-indexed field id for your custom ids. The benefit of using build-in ObjectID's is than they won't duplicate even if your database is extremely large. So, by replacing them with short id's you take the risk of id duplication.
Now about the performance. I think that the best solution is not to query DB for id's, because with properly adjusted ids length the probability of duplication is extremely small. So, the best way to handle ids duplication in this model is to check Mongo responses. If it responded with "duplicate key error" then you shall generate a new one.
And now about scaling. To scale your custom ids you can just add a few more symbols to it. "Duplicate key error" shall be a trigger for making that change. Normally there shall be no such errors. So, if they started to appear then its time to scale.

I don't think that generating ObjectId for _id field affect directly scalability or performance. Whereby this can be happen?
Main difference is that ObjectIds are created by MongoDB and you don't burden yourself with responsibility for this. Otherwise you must by yourself to determine optimal size of id and to ensure unique value for each _id field of documents stored in collection. It's required because _id used as primary key. This can be justified if you have not very big collection and custom value of identifier is need for you.
But you have such additional benefits with _id field that stores ObjectId values as opportunity to create object id's from time and use this fact to your advantage in queries. Also you can get timestamp of ObjectId’s creation with getTimestamp() method. And sorting on _id in this case is equivalent to sorting by creation time.
But if you're going to use ObjectId in URLs or HTML then for security concerns you can encrypt it. To prevent leakage of information and access to object's creation time. It may be security risk.
About your solutions:
1) I suppose this's very convenient and flexible solution. In this case you can specify any value in postId which doesn't depend directly on _id.
But little disadvantage of this solution is that you have to have extra field and to create extra index. While _id is automatically indexed.
2) I don't think this's good solution from the point of view of performance and philosophy of noSQL approach.

Good Shard Keys in MongoDB

From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}

Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse