Generating shard key field for multi tenant mongodb app - mongodb

I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?

If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

Uniqueness of _id within a shard

I'm looking into sharding using mongodb, and most if it is rather straight forward. I have some experience with sharding in other databases, so I'm not asking about the concept itself. There's one thing I'm confused by, and there doesn't seem to be anything in the documentation about this, so here goes.
Is _id required to be unique within the shard, regardless of shard key?
A small scale (single shard) test seems to confirm that this is the case. It does however seem like a less than stellar approach to sharding, which has me confused. To me it would make more sense to require shard-key + _id to be unique (i.e. use a compound key), or you'll have inconsistent behavior depending on where your shard-keys end up being routed to. My data model uses deterministic keys, and the shard key is an intrinsic part of it. So I guess it comes down to, did I do something wrong in my small scale test? Do I need to store the shard-key twice, once as a shard-key field and once as part of _id? Or is there some special case where I can somehow declare a compound key using shard-key and _id?
Update
For completeness, this is the trivial case I'm testing, inserting the following two documents:
{"_id": 1, "shardkey": 1}
{"_id": 1, "shardkey": 2}
First one obviously goes through, second one fails. If I would've had two shards, and the shard keys would've been routed to different shards, I assume both would've succeeded.
I can obviously just combine the shard-key and the id to create the _id field for mongodb, since this is really the key I'm using, but it seems like a weird way to approach the problem from a database architectural standpoint.
_id needs to be unique, always, whether the collection is sharded or not. The shard key does not need to be unique. It is used to split the collection into chunks which can be split onto the shards making up the database. The shard key needs to provide enough granularity to split the documents in the collection into chunks. Its obviously a good idea to link the shard key to how you query the data, and use a shard key which relates to the fields that you query on. This way the queries you run will be easily directed to the relevant shards to satisfy the query. If the shard key isnt selective enough then the query will need to go to multiple shards to find the correct documents. You can create a compound index on _id + shard-key and make it unique if you want.
I realise this doesnt fully answer the question. tbh I am struggling to understand what you're asking. Perhaps if you could post an example of the documents you're storing and the queries you're running it would help.

MongoDB shard key

I've been thinking about selecting the best shard key (through a compound index) for my data and thought the combination of the document creation date combined with a customer no. (or invoice no.) would be a good combination. IF MongoDB would consider the customer no as a string backwards ie.:
90043 => 34009
90044 => 44009
90045 => 54009
etc.
Index on the The creation date would ensure that relatively new data are kept in memory and the backward customer no would help MongoDB to distribute the data/load across the cluster.
Is this a correct assumption? and if so... would I need to save my customer no reversed for it to be distributed the way I expect?
Regarding your specific question of "would I need to save my customer no reversed for it to be distributed the way I expect?", no - you would not.
Even with the relatively narrow spread of customer number values you listed, if you use customerNumber in your compound key, MongoDB will break apart the data into chunks and distribute these accordingly. As long as the data associated with customerNumber are relatively evenly distributed (e.g., one user doesn't dominate the system), you will get the shard balancing you desire.
I would consider either your original choice (minus the string reversal) or Dan's choice (using the built-in ObjectId instead of timestamp) as good candidates for your compound key.
from what I have read in the documentation the MongoId is already time based.
Therfore you can add the _id to your compound key like this: (_id, customerid). If you don't need the date in your application, you can just drop the field which would save you some storage.
MongoDB stores the datasets recently used in memory.
The index of a collection will always tried to be stored into RAM.
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
Hope this helps.
Cheers dan
I think the issue with your thinking it that, somehow, you feel Node 1 would be faster than Node 2. Unless the hardware is drastically different then Node 1 and Node 2 would be accessed equally fast and thus reversing the strings would not help you out.
The main issue I see has to do with the number of customers in your system. This can lead to monotonic sharding wherein the last shard is the one always being hit and that can cause excessive splitting and migration. If you have a large number of customers then there is no issue, otherwise you might want to add another key on top of the customer id and date fields to more evenly divide up your content. I have heard of people using random identifiers, hashing the _id or using a GUID to overcome this issue.

Good Shard Keys in MongoDB

From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.

Why does ObjectId make sharding easier in MongoDB?

I keep reading that using an ObjectId as the unique key makes sharding easier, but I haven't seen a relatively detailed explanation as to why that is. Could someone shed some light on this?
The reason I ask is that I want to use an english string (which will be unique obviously) as the unique key, but want to make sure that it won't tie my hands later on.
I've just recently been getting familiar with mongoDB myself so take this with a grain of salt but I suspect that sharding is probably more efficient when using ObjectId rather that your own key values because of the fact that part of the ObjectId will point out which machine or shard that the document was created on. The bottom of this page in the mongo docs explains what each portion of the ObjectId means.
I asked this question on Mongo user list and basically the reply was that it's OK to generate your own value of _id and it will not make sharding more difficult. For me sometimes it's necessary to have numeric values on _id like when I'm going to use them in url, so I'm generating my own _id in some collections.
ObjectId is designed to be globally unique. So, when used as a primary key and a new record is appended to the dataset without primary key value, then each shard can generate a new objectid and not worry about collisions with other shards. This somewhat simplifies life for everyone :)
Shard key does not have to be unique. We can't conclude that sharding a collection based on object id is always efficient .
Actually, ObjectID is probably a poor choice for a shard key.
From the docs (http://docs.mongodb.org/manual/core/sharded-cluster-internals/ the section on "Write Scaling"):
"[T]he most significant bits of [an ObjectID] represent a time stamp, which means that they increment in a regular and predictable pattern. [Therefore] all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster."
In other words, because every OID sorts "bigger" than the one created immediately before it, an inserts that are keyed by OID will land on the same machine, and the write I/O capacity of that one machine will be the total I/O of your entire cluster. (This is true not just of OIDs, but any predictable key -- timestamps, autoincrementing numbers, etc.)
Contrariwise, if you chose a random string as your shard key, writes would tend to distribute evenly over the cluster, and your throughput would be the total I/O of the whole cluster.
(EDIT to be complete: with an OID shard key, as new records landed on the "rightmost" shard, the balancer would handle moving them elsewhere, so they would eventually end up on other machines. But that doesn't solve the I/O problem; it actually makes it worse.)