DynamoDB equivalent to find().sort() - mongodb

In mongoDB one can get all the documents sorted asc/desc on a particular field as follows:
db.collection_name.find().sort({field: sort_order})
query() requires one to specify the partition key and if one wants to query on a non key attribute, A GSI can be created and queries can be run on it for the same as explained here: Query on non-key attribute
scan() would do the job but doesn't provide an option to sort on any field.
One of the solution as described here: Dynamodb scan in sorted order is to have a common key for all documents and create a GSI on the attribute.
But as listed in the comments of the one of the solutions and I quote "The performance characteristics of a DynamoDB table apply the same to GSIs. A GSI with a single hash key of "OK" will only ever use one partition. This loses all scaling characteristics of DynamoDB".
Is there a way to achieve the above that scales well?

The only sorting applied by DynamoDB is by range key within a partition. There is no feature for sorting results by arbitrary field, you are expected to sort your own results in your application code. i.e. do a scan and sort the results on the client side.

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

What data structure does Google Firebase Firestore use for it's default index

I'm curious if anyone knows, or can guess, the data structure Google's Firestore is using to index arbitrary NoSQL documents by every field. I'm looking to build something similar, making it as efficient as possible.
Some info about how their default index works:
all fields are indexed by default, but only works for equality searches not range (<,>)
any range searches require extra indexes
Source: https://firebase.google.com/docs/firestore/query-data/indexing
It's unlikely it's a standard btree index per field because the range searches would work without adding the requirement for another index. Plus if you added a new field (easy with document storage), it would take time to build an index and collections with billions of items.
One theory: 1 big index per document. Index "field_name:value" for every field in every document. The index maps to a sorted list document IDs which contain that field/value pair. It would be able to to equality search (my merging the sorted doc-ids for every equality requirement), but not a range search. Basically an inverted index.
Any suggestion for a better ways of implementing a pattern like this?
Clarification, single field indexes do support range/inequality queries, composite indexes are about combining multiple field filters in a single query. See this page for more on index types:
https://firebase.google.com/docs/firestore/query-data/index-overview
Each field index is stored in it's own key range with contiguous regions assigned to a server with compute and storage scaling independently under the covers. Cloud Firestore handles indexes fairly similar to Cloud Datastore (but not 100% the same).
You can see a basic overview on my Cloud Next conference session from last year.

Support for queries across documents in DynamoDB

I have been evaluating migration of our datastore from MongoDB to DynamoDB, since it is a well established AWS service.
However, I am not sure if the DynamoDB data model is robust enough to support our use cases. I understand that DynamoDB added document support in 2014, but whatever examples I have seen, does not look to be addressing queries which work across documents, and which do not specify a value for the partition key.
For instance if I have a document containing employee info,
{
"name": "John Doe",
"department": "sales",
"date_of_joining": "2017-01-21"
}
and I need to make query like give me all the employees which have joined after 01-01-2016, then I can't make it with this schema.
I might be able to make this query after creating a secondary index which has a randomly generated partition key (say 0-99) and create a sort key on "date_of_joining", then query for all the partitions and put condition on "date_of_joining". But this is too complex a way to do a simple query, doing something like this in MongoDB is quite straightforward.
Can someone help with understanding if there is a better way to do such queries in DynamoDB and is DynamoDB really suited for such use cases?
Actually, the partition key of the GSI need not be unique. You can have date_of_joining as a partition key of GSI.
However, when you query the partition key, you cannot use greater than for the partition key field. Only equality is supported for partition key. I am not sure that why you wanted to have a random number as partition key of GSI and date_of_joining as sort key. Even if you design like, I don't thing you will be able to use DynamoDB Query API to get the expected result. You may end-up using DynamoDB Scan API which is a costly operation in DynamoDB.
GSI:
date_of_joining - as Partition key
Supported in Query API:-
If you have multiple items for the same DOJ, the result with have multiple items (i.e. when you query using GSI).
KeyConditionExpression : 'date_of_joining = :doj'
Not supported in Query API:-
KeyConditionExpression : 'date_of_joining > :doj'
Conclusion:-
You need to use DynamoDB Scan. If you are going to use Scan, then GSI may not be required. You can directly scan the main table using FilterExpression.
FilterExpression : 'date_of_joining > :doj'
Disadvantage:-
Costly
Not efficient
You might decide to support your range queries with an indexing backend. For example, you could stream your table updates in DynamoDB to AWS ElasticSearch with a Lambda function, and then query ES for records matching the range of join dates you choose.

Choosing the right shard key in MongoDB

We are building our first MongoDB and currently we are trying to choose the right shard key.
Each document in our main collection contain around 40 voice call related fields and the main field that we use in queries is the UserId field. This is why we are thinking about compound shard key of userid and CallStartTime.
They are not sure regarding the second field since StartTime is always advancing and one might argue that it is not random enough. This led us to consider replace it with UserId and hashed _id (mongo internal id after hash).
Is the first option is ok or do we better use the latter?
Consider the recommendations in the documentation here: http://docs.mongodb.org/manual/core/sharded-cluster-internals/#shard-keys
Or, if there is no natural choice, consider using a hashed shard key (mongodb 2.4+)
http://docs.mongodb.org/manual/reference/glossary/#term-hashed-shard-key
What sort of queries are you performing? What are the access patterns.
Ideally you want a key with good cardinality, write scaling and query isolation.
In your examples above you would need to know the callstarttime or hash to avoid scatter-gather operations.

How can I specify the natural ordering in MongoDB?

Is there a way to specify the natural ordering of data in mongodb, similar to how a primary index would order data in a RDBMS table?
My use case is that all my queries return data sorted by a date field, say birthday. According to MongoDB: Sorting and Natural Order, the natural order for a standard (non-capped) collection is roughly the insertion order, but not guaranteed. This would imply sorting is needed after the data is retrieved.
I believe what you are referring to is a clustered index, not a primary index.
MongoDB 2.0 does not have a clustered index feature, so a regular index on date would be your most efficient option to retrieve.
It's probably premature optimization to think about the physical order on disk with MongoDB. MongoDB uses memory-mapped files, so depending on your working set + queries + RAM you may not need to load data from disk as often as expected.
If you are looking for something to act like the primary index in a RDBMS then sort by _id. It will be roughly insert order since the _id is prefixed with timestamp. If you try to use $natural order it will cause it to miss indexes.
Also, I would add that you should look into using the built-in timestamps in the document IDs instead of relying on a separate date field, as it allows you to store less data and removes an index.
Jason
MongoHQ
I guess it would be difficult to achieve what you want without the help of indexes. To support sharding, the _id field in MongoDB takes values based on the timestamp at the moment the document is created. As a consequence, you can't have them monotonically increasing unlike the identity column in RDBMS table..I think you must create an index on Birthdate column if all your queries return documents sorted in the order of Birthdate. Once the index is created, the queries become efficient enough..
Refer this:
MongoDB capped collection and monotically increasing index