Consider I have very huge table that needs to be sharded across the RDBMS cluster. I need to decide on the partitioning key on which to shard the table across. Obviously this partition key can’t be an artificial key (example: auto-generated primary key column), because the application needs to hold the logic of figuring out the shard depends on the natural key from request data. Consider the following situation
If the natural key is not evenly distributed in the system
a) Is it a good idea to even consider this table for sharding ?
Is there a way to generate a GUID based on the natural key and evenly distribute it across the cluster?
what can be an efficient algorithm to generate a GUID based on the natural key.
If the key is not evenly distributed it might not have any difference whether the table is partitioned or not. It will have to read almost same amount of rows in order to fulfil the query. Remember, partitioning will not always increase the performance. Reading across partitions will might be slower. So make sure you analyse all the query needs before selecting the partition key.
I can't recall any function which can generate partition key for this case. There are functions to generate GUIDs or MD5 for your data but the result will be worst than natural key that you have. The results will be more towards to unique values. Also it will drop the performance as each and every request it has to run additional logics.
Also please consider purging old or unused data. Once that is done you might not have partitioning need.
Related
Consider the following situation:
I have a large PostgreSQL table with a primary key of type UUID. The UUIDs are generated randomly and spread uniformly across the UUID space.
I partition the table on this UUID column on 256 ranges (e.g. based on the first 8 bits of the UUID).
All partitions are stored on the same physical disk.
Basically this means all the 256 partitions will be equally used (unlike with time-based paritionning where the most recent parititon would normally be hotter than the other ones).
Will I see any performance improvement at all by doing this type of partitioning:
For queries based on the UUID, returning a single row (WHERE uuid_key = :id)?
For other queries that must search all partitions?
Most queries will become slower. For example, if you search by uuid_key, the optimizer has to determine which partition to search, something that grows in expense with the number of partitions. The index scan itself will not be notably faster on a small table than on a big table.
You could benefit if you have several tables partitioned alike and you join them on the partitioning key, so that you get a partitionwise join (but remember to set enable_partitionwise_join = on). There are similar speed gains for partitionwise aggregates.
Even though you cannot expect a performance gain for your query, partitioning may still have its use, for example if you need several autovacuum workers to process a single table.
Will I see any performance improvement at all by doing this type of
partitioning:
For queries based on the UUID, returning a single row (WHERE uuid_key = :id)?
Yes: Postgresql will search only in the right partition. Also you can gain performances in insert or update, reducing page contention.
For other queries that must search all partitions?
Not really, but index desing can minimize the problem.
Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.
I have a collection of documents that looks like the following:
There is one document per VIN/SiteID and our access pattern is showing all documents
at a specific site. I see two potential partition keys we could choose from:
SiteID - We only have 75 sites so the cardinality is not very high. Also, the doucments are not very big so the 10GB limit is probably OK.
SiteID/VIN: The data is now more evenly distributed but now that means each logical partition will only store one item. is this an anti-pattern? also, so support our access pattern we will need to use a cross-partition query. again, the data set is small so is this a problem?
Based on what I am describing, which partition key makes more sense?
Any other suggestions would be greatly appreciated!
Your first option makes a lot of sense and could be a good partition key but the words "probably OK" don't really breed confidence. Remember, the only way to change the partition key is to migrate to a new collection. If you can take that risk then SiteId (which I'm guessing you will always have) is a good partition key.
If you have both VIN and SiteId when you are doing the reading or querying then this is the safer combination. There is no problem with having each logical partition to store one item per se. It's only a problem when you are doing cross partition queries. If you know both VIN and SiteId in your queries then it's a great plan.
You also have to remember that your RUs are evenly split between your partitions inside a collection.
I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?
If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.
I'm recently looking into the new NoSQL service that Amazon provides, more specifically the DynamoDB.
Amazon says you should avoid using unevenly distributed keys as primary key, namely the primary keys should be the more unique the better. Can I see this as having unique primary key for every item is the best case? How about having some items with duplicated keys?
I want to know how the underlying mechanism works so I know how bad it can be.
Tables are partitioned across multiple machines based on the hash key, so the more random they are the better. In my app I use company_id for the hash, then a unique id for the range, that way my tables can be distributed reasonably evenly.
What they are trying to avoid is you using the same hash key for the majority of your data, the more random they are the easier it is for Dynamo to keep your data coming back to you quickly.