DynamoDB: How is the hash key used? - hash

I'm recently looking into the new NoSQL service that Amazon provides, more specifically the DynamoDB.
Amazon says you should avoid using unevenly distributed keys as primary key, namely the primary keys should be the more unique the better. Can I see this as having unique primary key for every item is the best case? How about having some items with duplicated keys?
I want to know how the underlying mechanism works so I know how bad it can be.

Tables are partitioned across multiple machines based on the hash key, so the more random they are the better. In my app I use company_id for the hash, then a unique id for the range, that way my tables can be distributed reasonably evenly.
What they are trying to avoid is you using the same hash key for the majority of your data, the more random they are the easier it is for Dynamo to keep your data coming back to you quickly.

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

Index instead of primary key on UUID type in PostgreSQL

First, I have read a few posts about this, like this one:
Postgresql: UUID or SEQUENCE for primary key?
My question is quite simple: my IDs in my table are UUID v4 (created in Rails or from an iOS app).
As UUID by default is unique, can I remove the primary key on ID and just add an index on it? The main (and uniq?) goal is to save time (a few ms) on inserting (PostgreSQL won't have to verify if the ID is already used) at each insert.
Is-it a good choice ? Or do I keep the PK to add another verification of the uniqueness before inserting ?
For info, the table will manage maybe 10 millions records.
First: UUIDs are not really unique. But the chance to generate double values is really really low (How unique is UUID?).
But there are some other issues with UUID. UUIDs are made for exchange data between different points. So if you think of two databases which communicate both would share the same data sets with the same UUID. Now think about an archive were data sets from many sources are stored. You could have data sets with the same UUID from some old communications.
So it depends on your current (and maybe future possible?) use cases if this could create any problems.
Furthermore I am not sure if it creates any advantages against a simple integer value concerning the space of your primary key index. Notice that every primary key automatically creates an internal index per default (so there's no need to create a separate index nonetheless). So a primary key index for an integer column might be smaller and faster.
Both of the keys that you are describing are apparently being used as surrogate keys. Surrogate meaning that they are not derived from the incoming data, and therefore have no relationship to it other than providing uniqueness.
You do not need 2 keys for the purpose of providing uniqueness, so the answer to your question is that you can drop one or the other of the keys. The size of the table is not really a factor here, as uuid_v4() will provide uniqueness for vastly larger datasets than 10M rows.
Having 2 keys for uniqueness is not just unnecessary, it is also a bottleneck. Both values must be created at insertion time, and both must be validated for uniqueness. Deleting one of them is a clearly better practice.

Partition key generation for RDBMS sharding

Consider I have very huge table that needs to be sharded across the RDBMS cluster. I need to decide on the partitioning key on which to shard the table across. Obviously this partition key can’t be an artificial key (example: auto-generated primary key column), because the application needs to hold the logic of figuring out the shard depends on the natural key from request data. Consider the following situation
If the natural key is not evenly distributed in the system
a) Is it a good idea to even consider this table for sharding ?
Is there a way to generate a GUID based on the natural key and evenly distribute it across the cluster?
what can be an efficient algorithm to generate a GUID based on the natural key.
If the key is not evenly distributed it might not have any difference whether the table is partitioned or not. It will have to read almost same amount of rows in order to fulfil the query. Remember, partitioning will not always increase the performance. Reading across partitions will might be slower. So make sure you analyse all the query needs before selecting the partition key.
I can't recall any function which can generate partition key for this case. There are functions to generate GUIDs or MD5 for your data but the result will be worst than natural key that you have. The results will be more towards to unique values. Also it will drop the performance as each and every request it has to run additional logics.
Also please consider purging old or unused data. Once that is done you might not have partitioning need.

SQL like querying on a DB with more than 1 key in a table

We know that there is the concept of a primary key in traditional RDBMS systems. This primary key is basically used to index records in the table on this particular key for faster retrieval. I know that there are NOSQL stores like Cassandra which offer secondary key indexing but is there a way or an existing DB which follows exactly the same schema as in traditional RDBMS systems (i.e. a DB split into various tables to hold different kinds of data) but provides indexing on 2 or more keys.
An example of a use case for the same is:
There is a one-to-one mapping between 10 different people's names and their ages. Now if I keep this information in a table with the name of the person being the primary key, then retrieval of age given the name of a person is relatively faster than retrieving the name given the age of the person. If i could index both the columns, then the second case also would have been faster.
An alternative to doing this with traditional RDBMS would be to have 2 tables with the same data with just the difference that the primary key in one of them is the name and in the other, it is the age but that would be a wastage of a large amount of space in case of large number of records.
It is sad to see no response on this question for a very long time. In all this time of doing some research on the same , I found FastBit Index as one of the plausible solutions for doing indexing on virtually every column of the record in a table. It also provides SQL like semantics for querying data and delivers performance of the order of a few milliseconds when queried on millions of rows of data (of the order of GBs).
Please suggest if there are any other NOSQL or SQL DBs which can deliver similar kind of functionality with a good performance level.

DynamoDB: Get All Items

I'm trying to retrieve all of the keys from a DynamoDB table in an optimized way. There are millions of keys.
In Cassandra I would probably create a single row with a column for every key which would eliminate to do a full table scan. DynamoDBs 64k limit per Item would seemingly preclude this option though.
Is there a quick way for me to get back all of the keys?
Thanks.
I believe the DynamoDB analogue would be to use composite keys: have a primary key of "allmykeys" and a range attribute of the originals being tracked: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/DataModel.html#DataModelPrimaryKey
I suspect this will scale poorly to billions of entries, but should work adequately for a few million.
Finally, again as with Cassandra, the most straightforward solution is to use map/reduce to get the keys: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html