Support for queries across documents in DynamoDB - mongodb

I have been evaluating migration of our datastore from MongoDB to DynamoDB, since it is a well established AWS service.
However, I am not sure if the DynamoDB data model is robust enough to support our use cases. I understand that DynamoDB added document support in 2014, but whatever examples I have seen, does not look to be addressing queries which work across documents, and which do not specify a value for the partition key.
For instance if I have a document containing employee info,
{
"name": "John Doe",
"department": "sales",
"date_of_joining": "2017-01-21"
}
and I need to make query like give me all the employees which have joined after 01-01-2016, then I can't make it with this schema.
I might be able to make this query after creating a secondary index which has a randomly generated partition key (say 0-99) and create a sort key on "date_of_joining", then query for all the partitions and put condition on "date_of_joining". But this is too complex a way to do a simple query, doing something like this in MongoDB is quite straightforward.
Can someone help with understanding if there is a better way to do such queries in DynamoDB and is DynamoDB really suited for such use cases?

Actually, the partition key of the GSI need not be unique. You can have date_of_joining as a partition key of GSI.
However, when you query the partition key, you cannot use greater than for the partition key field. Only equality is supported for partition key. I am not sure that why you wanted to have a random number as partition key of GSI and date_of_joining as sort key. Even if you design like, I don't thing you will be able to use DynamoDB Query API to get the expected result. You may end-up using DynamoDB Scan API which is a costly operation in DynamoDB.
GSI:
date_of_joining - as Partition key
Supported in Query API:-
If you have multiple items for the same DOJ, the result with have multiple items (i.e. when you query using GSI).
KeyConditionExpression : 'date_of_joining = :doj'
Not supported in Query API:-
KeyConditionExpression : 'date_of_joining > :doj'
Conclusion:-
You need to use DynamoDB Scan. If you are going to use Scan, then GSI may not be required. You can directly scan the main table using FilterExpression.
FilterExpression : 'date_of_joining > :doj'
Disadvantage:-
Costly
Not efficient

You might decide to support your range queries with an indexing backend. For example, you could stream your table updates in DynamoDB to AWS ElasticSearch with a Lambda function, and then query ES for records matching the range of join dates you choose.

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

DynamoDB equivalent to find().sort()

In mongoDB one can get all the documents sorted asc/desc on a particular field as follows:
db.collection_name.find().sort({field: sort_order})
query() requires one to specify the partition key and if one wants to query on a non key attribute, A GSI can be created and queries can be run on it for the same as explained here: Query on non-key attribute
scan() would do the job but doesn't provide an option to sort on any field.
One of the solution as described here: Dynamodb scan in sorted order is to have a common key for all documents and create a GSI on the attribute.
But as listed in the comments of the one of the solutions and I quote "The performance characteristics of a DynamoDB table apply the same to GSIs. A GSI with a single hash key of "OK" will only ever use one partition. This loses all scaling characteristics of DynamoDB".
Is there a way to achieve the above that scales well?
The only sorting applied by DynamoDB is by range key within a partition. There is no feature for sorting results by arbitrary field, you are expected to sort your own results in your application code. i.e. do a scan and sort the results on the client side.

How to do basic aggregation with DynamoDB?

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.
Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.
Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.
Basic aggregation can be done using scan() and query() in lambda.

Generating shard key field for multi tenant mongodb app

I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?
If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.

How can I specify the natural ordering in MongoDB?

Is there a way to specify the natural ordering of data in mongodb, similar to how a primary index would order data in a RDBMS table?
My use case is that all my queries return data sorted by a date field, say birthday. According to MongoDB: Sorting and Natural Order, the natural order for a standard (non-capped) collection is roughly the insertion order, but not guaranteed. This would imply sorting is needed after the data is retrieved.
I believe what you are referring to is a clustered index, not a primary index.
MongoDB 2.0 does not have a clustered index feature, so a regular index on date would be your most efficient option to retrieve.
It's probably premature optimization to think about the physical order on disk with MongoDB. MongoDB uses memory-mapped files, so depending on your working set + queries + RAM you may not need to load data from disk as often as expected.
If you are looking for something to act like the primary index in a RDBMS then sort by _id. It will be roughly insert order since the _id is prefixed with timestamp. If you try to use $natural order it will cause it to miss indexes.
Also, I would add that you should look into using the built-in timestamps in the document IDs instead of relying on a separate date field, as it allows you to store less data and removes an index.
Jason
MongoHQ
I guess it would be difficult to achieve what you want without the help of indexes. To support sharding, the _id field in MongoDB takes values based on the timestamp at the moment the document is created. As a consequence, you can't have them monotonically increasing unlike the identity column in RDBMS table..I think you must create an index on Birthdate column if all your queries return documents sorted in the order of Birthdate. Once the index is created, the queries become efficient enough..
Refer this:
MongoDB capped collection and monotically increasing index