What does Get without clusterKey return in ScalarDB - scalardb

In ScalarDB, the library which provides ACID functionality on top of Cassandra, what does Get return if clustering key is not present and there are multiple rows in the database for same partition key?

The Get without clustering key fails and throws InvalidUsageException when there are multiple rows with the same partition key.
You can use Scan for retrieving multiple rows with the same partition key.

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

DynamoDB equivalent to find().sort()

In mongoDB one can get all the documents sorted asc/desc on a particular field as follows:
db.collection_name.find().sort({field: sort_order})
query() requires one to specify the partition key and if one wants to query on a non key attribute, A GSI can be created and queries can be run on it for the same as explained here: Query on non-key attribute
scan() would do the job but doesn't provide an option to sort on any field.
One of the solution as described here: Dynamodb scan in sorted order is to have a common key for all documents and create a GSI on the attribute.
But as listed in the comments of the one of the solutions and I quote "The performance characteristics of a DynamoDB table apply the same to GSIs. A GSI with a single hash key of "OK" will only ever use one partition. This loses all scaling characteristics of DynamoDB".
Is there a way to achieve the above that scales well?
The only sorting applied by DynamoDB is by range key within a partition. There is no feature for sorting results by arbitrary field, you are expected to sort your own results in your application code. i.e. do a scan and sort the results on the client side.

Support for queries across documents in DynamoDB

I have been evaluating migration of our datastore from MongoDB to DynamoDB, since it is a well established AWS service.
However, I am not sure if the DynamoDB data model is robust enough to support our use cases. I understand that DynamoDB added document support in 2014, but whatever examples I have seen, does not look to be addressing queries which work across documents, and which do not specify a value for the partition key.
For instance if I have a document containing employee info,
{
"name": "John Doe",
"department": "sales",
"date_of_joining": "2017-01-21"
}
and I need to make query like give me all the employees which have joined after 01-01-2016, then I can't make it with this schema.
I might be able to make this query after creating a secondary index which has a randomly generated partition key (say 0-99) and create a sort key on "date_of_joining", then query for all the partitions and put condition on "date_of_joining". But this is too complex a way to do a simple query, doing something like this in MongoDB is quite straightforward.
Can someone help with understanding if there is a better way to do such queries in DynamoDB and is DynamoDB really suited for such use cases?
Actually, the partition key of the GSI need not be unique. You can have date_of_joining as a partition key of GSI.
However, when you query the partition key, you cannot use greater than for the partition key field. Only equality is supported for partition key. I am not sure that why you wanted to have a random number as partition key of GSI and date_of_joining as sort key. Even if you design like, I don't thing you will be able to use DynamoDB Query API to get the expected result. You may end-up using DynamoDB Scan API which is a costly operation in DynamoDB.
GSI:
date_of_joining - as Partition key
Supported in Query API:-
If you have multiple items for the same DOJ, the result with have multiple items (i.e. when you query using GSI).
KeyConditionExpression : 'date_of_joining = :doj'
Not supported in Query API:-
KeyConditionExpression : 'date_of_joining > :doj'
Conclusion:-
You need to use DynamoDB Scan. If you are going to use Scan, then GSI may not be required. You can directly scan the main table using FilterExpression.
FilterExpression : 'date_of_joining > :doj'
Disadvantage:-
Costly
Not efficient
You might decide to support your range queries with an indexing backend. For example, you could stream your table updates in DynamoDB to AWS ElasticSearch with a Lambda function, and then query ES for records matching the range of join dates you choose.

Partition key generation for RDBMS sharding

Consider I have very huge table that needs to be sharded across the RDBMS cluster. I need to decide on the partitioning key on which to shard the table across. Obviously this partition key can’t be an artificial key (example: auto-generated primary key column), because the application needs to hold the logic of figuring out the shard depends on the natural key from request data. Consider the following situation
If the natural key is not evenly distributed in the system
a) Is it a good idea to even consider this table for sharding ?
Is there a way to generate a GUID based on the natural key and evenly distribute it across the cluster?
what can be an efficient algorithm to generate a GUID based on the natural key.
If the key is not evenly distributed it might not have any difference whether the table is partitioned or not. It will have to read almost same amount of rows in order to fulfil the query. Remember, partitioning will not always increase the performance. Reading across partitions will might be slower. So make sure you analyse all the query needs before selecting the partition key.
I can't recall any function which can generate partition key for this case. There are functions to generate GUIDs or MD5 for your data but the result will be worst than natural key that you have. The results will be more towards to unique values. Also it will drop the performance as each and every request it has to run additional logics.
Also please consider purging old or unused data. Once that is done you might not have partitioning need.

DynamoDB: Get All Items

I'm trying to retrieve all of the keys from a DynamoDB table in an optimized way. There are millions of keys.
In Cassandra I would probably create a single row with a column for every key which would eliminate to do a full table scan. DynamoDBs 64k limit per Item would seemingly preclude this option though.
Is there a quick way for me to get back all of the keys?
Thanks.
I believe the DynamoDB analogue would be to use composite keys: have a primary key of "allmykeys" and a range attribute of the originals being tracked: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/DataModel.html#DataModelPrimaryKey
I suspect this will scale poorly to billions of entries, but should work adequately for a few million.
Finally, again as with Cassandra, the most straightforward solution is to use map/reduce to get the keys: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html