MongoDB shard key - mongodb

I've been thinking about selecting the best shard key (through a compound index) for my data and thought the combination of the document creation date combined with a customer no. (or invoice no.) would be a good combination. IF MongoDB would consider the customer no as a string backwards ie.:
90043 => 34009
90044 => 44009
90045 => 54009
etc.
Index on the The creation date would ensure that relatively new data are kept in memory and the backward customer no would help MongoDB to distribute the data/load across the cluster.
Is this a correct assumption? and if so... would I need to save my customer no reversed for it to be distributed the way I expect?

Regarding your specific question of "would I need to save my customer no reversed for it to be distributed the way I expect?", no - you would not.
Even with the relatively narrow spread of customer number values you listed, if you use customerNumber in your compound key, MongoDB will break apart the data into chunks and distribute these accordingly. As long as the data associated with customerNumber are relatively evenly distributed (e.g., one user doesn't dominate the system), you will get the shard balancing you desire.
I would consider either your original choice (minus the string reversal) or Dan's choice (using the built-in ObjectId instead of timestamp) as good candidates for your compound key.

from what I have read in the documentation the MongoId is already time based.
Therfore you can add the _id to your compound key like this: (_id, customerid). If you don't need the date in your application, you can just drop the field which would save you some storage.
MongoDB stores the datasets recently used in memory.
The index of a collection will always tried to be stored into RAM.
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
Hope this helps.
Cheers dan

I think the issue with your thinking it that, somehow, you feel Node 1 would be faster than Node 2. Unless the hardware is drastically different then Node 1 and Node 2 would be accessed equally fast and thus reversing the strings would not help you out.
The main issue I see has to do with the number of customers in your system. This can lead to monotonic sharding wherein the last shard is the one always being hit and that can cause excessive splitting and migration. If you have a large number of customers then there is no issue, otherwise you might want to add another key on top of the customer id and date fields to more evenly divide up your content. I have heard of people using random identifiers, hashing the _id or using a GUID to overcome this issue.

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

MongoDB Sharding Key

We have a large MongoDB collection that we'd like to start sharding. The collection has 3.4B records and is ~14.6TB in size (5.3TB compressed on disk). This collection typically sees writes on the order of ~5M per hour, but we expect this to continue to grow year over year. The indexes on this collection are ~220GB in size.
All records have a feedId and all queries will be for records belong to a specific feedId. There are currently ~200 unique feedId values, but the distribution across each value is highly non-linear. On the low end, some feedId's may only see dozens of records per day. On the other hand, the top 5 feedId's make up ~75% of the dataset.
Records also have a timestamp and queries will always be for a given date range. The timestamp field is more-or-less monotonic.
There is already an existing compound index on feedId and timestamp.
The typical working set for this collection is only the last few weeks worth of data, and is therefor only a very small percentage of the actual data. Queries for this data must be very fast, with slower queries for the historical data being acceptable. As such, we're planning to use "tags" and/or "zones" to move older data to nodes with larger HDD's and use nodes with SSD's for the "hot" data.
Based on these factors, is using a shard key of {feedId: 1, timestamp: 1} reasonable? My feeling is that it may lead to "hot" nodes due to the non-linearity of feedId and the monotonic nature of timestamp. Would adding a "hashed" field to the key make it better/worse?
So lets take this bit by bit!
The collection has 3.4B records and is ~14.6TB in size (5.3TB compressed on disk)
The nature of sharding is such that it's important to get this right the first time through. I'm going to go into more detail here, but the TL;DR is:
Extract a portion of your dataset (e.g. using mongodump --query) to a staging cluster (e.g. using mongorestore)
Point a sample workload at the staging cluster to emulate your production environment
Test one or more shard key combinations. Dump/reload as needed until you're satisfied with performance.
Now, lets dig in:
There are currently ~200 unique feedId values, but the distribution across each value is highly non-linear. On the low end, some feedId's may only see dozens of records per day. On the other hand, the top 5 feedId's make up ~75% of the dataset.
So one field that supports a good chunk of your queries has pretty low frequency. You are definitely likely to see hotspotting if you were just sharding on this field 1
Records also have a timestamp and queries will always be for a given date range. The timestamp field is more-or-less monotonic.
So another field that supports the majority of your queries, but also not great for sharding 2
Records also have a timestamp and queries will always be for a given date range. The timestamp field is more-or-less monotonic.
This to me kind of implies that the primary field you're querying against is time based. For a given period of time give me the documents with the specified feedID. You're also going to get targeted queries, because you're querying on the shard key more often than not (e.g. either on a range of time, or a range of time + the feedId). 3
This also supports your idea for zoning:
As such, we're planning to use "tags" and/or "zones" to move older data to nodes with larger HDD's and use nodes with SSD's for the "hot" data.
With zoning, you can use any key in the shard key, as long as you include the entire prefix leading up to that key. So { feedId: 1, timestamp: 1 } would principally support zones on feedId and timestamp, which isn't quite what you are looking for. 4
Based on that alone, I would venture that { timestamp : 1, feedId : 1 } would be a good selection. What your testing would need to look into is whether adding
a low-frequency field to a monotonically increasing field provides good chunk distribution.
Now, as far as hashing:
Would adding a "hashed" field to the key make it better/worse?
If you mean, your documents already have some hashed field, then you could definitely add that just for randomness. But if you're talking about a hashed shard key, then that's a different story. 5
Zones and hashed shard keys don't play together. The nature of the hashed shard key means that the chunk ranges (and therefore zones) represent the hashed shard key values. So even if you have two documents with values that are very near to each other, they are likely to end up on completely different chunks. So creating a zone on a range of hashed shard key values probably wont do what you want it to do. You could do something like using zones with hashed sharding to move the entire collection onto a subset of shards in the cluster, but that's not what you want to do. 6
Now there is one critical issue you might run into - you have a huge collection. Your choice of shard key might cause issues for the initial split where MongoDB attempts to divide your data into chunks. Please take a look at the following section in our documentation: Sharding an Existing Collection. There is a formula there for you to use to estimate the max collection size your shard key can support with the configured chunk size (64MB by default). I'm going to guess that you'll need to increase your chunk size to 128MB or possibly 256MB initially. This is only required for the initial sharding procedure. Afterwards you can reduce chunk size back to defaults and let MongoDB handle the rest.
Mind you, this is going to have a performance impact. You'll have chunks migrating across shards, plus the overhead for the actual chunk split. I would recommend you post to our Google Group for more specific guidance here.

Generating shard key field for multi tenant mongodb app

I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?
If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.

Use Native Java UUID.getRandom() as Sharding key and as the _id?

I understand that with a write-heavy application, using the ObjectId is a really bad idea for a sharding key. However, would it be a good idea to use native *UUID.randomUUID() from Java as a Shard key since they are truly random and won't cause hotspotting for a single shard.
These IDs are 128 bit ID and look like :
5842fa92557947f1b020041ff74868a4
308947443e564d80b97dd8411b4b727e
f8a7ee765bed4ce3bcc5800ac3a2a710
1bcfd08b89e94c58ae7695b3e7a1bc4f
It's very similar to an ObjectId (96bit int).
Plus, since this is mandatory to have an Index on the _id, the shard key would be the _id and we would save RAM by creating another index for the shard_key. Everything collection would be ready for sharding.
Is it for performances issues within Mongod or for disk/ram space problem?
The collision rate for a UUID is (from wikipedia) :
only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
Using UUID would distribute your write access across the shards but you would have no query isolation, so you'll have less than optimum results with your queries. The fastest queries are the one answered by only one shard.
http://docs.mongodb.org/manual/core/sharding-internals/#sharding-shard-key-query-isolation
That would help to know what is in your collection to help you more efficiently.
Using UUIDs is perfectly ok (provided that you are only going to lookup those documents by their primary/shard key). One of the purposes of shard key is to group related documents together. If we're building, say, flickr, our shard key would start with user_id, so that photos of a user sit together on one shard. If your documents are not related and primary key is also a shard key, then there's no problem.
You may run into a problem due to https://jira.mongodb.org/browse/JAVA-403, which is slated to be fixed in the next release.

Why does ObjectId make sharding easier in MongoDB?

I keep reading that using an ObjectId as the unique key makes sharding easier, but I haven't seen a relatively detailed explanation as to why that is. Could someone shed some light on this?
The reason I ask is that I want to use an english string (which will be unique obviously) as the unique key, but want to make sure that it won't tie my hands later on.
I've just recently been getting familiar with mongoDB myself so take this with a grain of salt but I suspect that sharding is probably more efficient when using ObjectId rather that your own key values because of the fact that part of the ObjectId will point out which machine or shard that the document was created on. The bottom of this page in the mongo docs explains what each portion of the ObjectId means.
I asked this question on Mongo user list and basically the reply was that it's OK to generate your own value of _id and it will not make sharding more difficult. For me sometimes it's necessary to have numeric values on _id like when I'm going to use them in url, so I'm generating my own _id in some collections.
ObjectId is designed to be globally unique. So, when used as a primary key and a new record is appended to the dataset without primary key value, then each shard can generate a new objectid and not worry about collisions with other shards. This somewhat simplifies life for everyone :)
Shard key does not have to be unique. We can't conclude that sharding a collection based on object id is always efficient .
Actually, ObjectID is probably a poor choice for a shard key.
From the docs (http://docs.mongodb.org/manual/core/sharded-cluster-internals/ the section on "Write Scaling"):
"[T]he most significant bits of [an ObjectID] represent a time stamp, which means that they increment in a regular and predictable pattern. [Therefore] all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster."
In other words, because every OID sorts "bigger" than the one created immediately before it, an inserts that are keyed by OID will land on the same machine, and the write I/O capacity of that one machine will be the total I/O of your entire cluster. (This is true not just of OIDs, but any predictable key -- timestamps, autoincrementing numbers, etc.)
Contrariwise, if you chose a random string as your shard key, writes would tend to distribute evenly over the cluster, and your throughput would be the total I/O of the whole cluster.
(EDIT to be complete: with an OID shard key, as new records landed on the "rightmost" shard, the balancer would handle moving them elsewhere, so they would eventually end up on other machines. But that doesn't solve the I/O problem; it actually makes it worse.)