As it says in Dynamodb documentation, it's recommended that we use only one table to model all our entities.
You should maintain as few tables as possible in a DynamoDB application. Most well-designed applications require only one table.
Now suppose that we have a product and a user entity, using only one table we have a schema like this:
In dynamodb, its recommended that we keep related data together, that's why the user data is "duplicated" on the product entry.
My question is, if one day I update the user name, dynamodb will be able to update automatically the copy of that user on my product entry, or this kind of update has to be made manual?
In DynamoDB, it is recommended to keep the items in de-normalized form for achieving the benefits of DynamoDb. Having said that, while designing the table we keep the application layer design in mind based on which we try to fetch the results from the single table to get the values that can be used to create the single entity with all the mappings satisfied. Hence we create the table with columns that can hold the values from other related table. The only difference is we are just putting the relationship values for keeping the connection to other related tables.
In the above scenario, we can have user details in one table and while creating the table for product, keep the primary key of user table in the product table. So that, if the username or user detail is changed in future, there wouldn't be any problem.
In DynamoDB, using sort key for the table, will keep the related items together. There is also a provision of composite sort keys to deal with one-many relation.
Sharing the Best practices of using sort keys:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
Related
I have a DynnamoDB/NoSQL/MongoDB question. I am from an RDBMS backround and struggling to get a design right for a NoSQL DB. If anyone can help!
I have the following objects (tables in my terms):
Organisation
Users
Courses
Units
I want the following access points, of which most are achievable:
Get/Create/Update and Delete Organisation
Get/Create/Update and Delete Users
Get/Create/Update and Delete Courses
Which I can achieve.
Issue is that the Users and Courses objects has many way to retrieve data:
email
username
For example: List Users on course.
List users for Org.
List courses for Org.
List users in org
list users in unit
All these user secondary indexes, which I semi-understand, but I have tertiary..ish indexes, but that probably my design.
Coming from a relational methodology, I am not sure about reporting, how would it work if I wanted to do a search for all users under the courses which have not completed their (call it status flag)?
From what I understand I need indexes for everting I want to search by?
AWS DynamoDB is my preference, but another NoSQL I'll happily consider. I realise that I need more education regarding NoSQL, so please if anyone can provide good documentation and examples which help the learning process, that will be awesome.
Regards
Richard
I have watched a few UDEMY videos and been Gooling for many weeks (oh and checked here "obviously")
Things to keep in mind
Partitions
In DynamoDB everything is organized in partitions that give you hash-based access to elements. This is very powerful in terms of performance but each partition has limits, so similarly to the hash function in hash maps the partition keys should try to equally distribute the elements
Single Table Design
Don't split the data into multiple tables. This makes everything harder and actually limits the capabilities of the DB. Store everything in a single table.
Keys
Keys in dynamo have to be designed around your access patterns. This is the hardest part.
You have Partition Key (Hash Key) -> this key has to be exactly specified every time. You can't perform a query without knowing the PK. This is why placing things like timestamps into PK is really bad idea.
Sort (Range) keys -> these are used for querying as specified in the AWS docs.
Attribute names
DB migrations are really hard in NoSQL so you have to use generic names for the attributes. They should not have any meaning.
For example "UserID" is a bad name for partition key, "PK" is a good name for partition key, same goes for all keys.
Indexes
You have two types of indexes, local and global.
Local Indexes are created once when you create the table and can't be changed (easily) afterward. You can only have a few of them. They give you an extra sort key to work with. The main benefit is that they are strongly consistent
Global Indexes can be created at any time. They give you both new partition key and sort key to work with but are eventually consistent. Got with global indexes unless you have a good reason to use local.
Going back to your problem, if we focus on one of the table as an example - Users
The user can be inserter like this (for example)
PK SK GSI1PK GSI1SK
Username#john123 Email#jhon#gmail.com Email#jhon#gmail.com Username#john123 <User Data>
This way you can query users by email and username. Keep in mind that PK and SK have to be unique pairs. SK in this case is free and can be used for other access patterns (which you didn't provide)
Another way might be to copy the data
PK SK
Username#john123 Email#jhon#gmail.com <user data>
Email#jhon#gmail.com Username#john123 <user data>
this way you avoid having to deal with indexes (which might be expensive sometimes) but you have to manually keep the user data consistent.
Further reading
-> https://www.alexdebrie.com/posts/dynamodb-single-table/
-> my medium post
I'm coming at this problem with a RDMS background so some of the best practices of document databases is new to me. I'm trying to understand the best way to store shared data and access rights to that data. The schema in SQL Server might look like this:
Project Table
projectId PK
ownerId FK User.userId
title
...
User Table
userId PK
name
...
ProjectShare Table
sharedById FK User.userId
sharedWithId FK User.userId
state
...
With the above tables I could query all projects that a user has access to. I could then query for all the data related to each project. Each project will have many related tables. The hierarchical nature of the data seems well suited for a document database.
How would I best structure something like this in a document database like MongoDB, CouchDB or DocumentDB?
There are indeed multiple approaches to model this data in DocumentDB.
Collections in DocumentDB can host heterogeneous set of documents and can be partitioned for massive scale.
Depending on the query requirements, data could be denormalized in many directions - either by pivoting on project (and keeping all users associated including owners, shared by and sharedWith details) or by pivoting on users (and keeping all the projects they own, the details of the projects including information of other users who shared this project etc).
One can also control the level of denormalization by simply storing a soft reference and keeping the referred information as a separate document. For instance, if we pivot by project, we could store all of user information repeatedly in each project document or just store userId alone (in which case user information is stored in a separate document). We can control how much referred data to store based on your query/ logical integrity constraints.
In my PostgreSQL I have two tables board and cards tables with OneToMany relationship between them(one board can have a multiple cards).
User can hold a few cards on the board. In order to implement this functionality typically I would created another table called for example cards_on_hold with OneToMany relationship and placed cards on hold IDs into this table. In order to fetch this data for board I'd use JOIN between board and cards_on_hold.
Is there any more effective way in PostgreSQL to store cards on hold IDs ? Maybe for example some feature to store this list inline in board table ? I'll need to use this IDs list later in IN SQL clause in order to filter card set.
Postgres does support arrays of integers (assuming your ids are integers):
http://www.postgresql.org/docs/9.1/static/arrays.html
However manipulating that data is a bit hard compared to a separate table. For example with a separate table you can put a uniqueness guarantee so that you won't have duplicates of ids (assuming you'd want that). To achieve the same thing with an array you would have to create a stored procedure to detect duplicates (on insert for example). That would be hard (if possible at all) to be as efficient as simple unique constraint. Not to mention that you lose consistency guarantee because you can't put foreign key constraint on such array.
So in general conisistency would be an issue with inline list. At the same time I doubt you would get any noticable performance gain. After all arrays should not be used as an "aggregated foreign key" IMHO.
All in all: I suggest you stick to a separate table.
I have the following objects Company, User and Order (contains orderlines). User's place orders with 1 or more orderlines and these relate to a Company. The time period for which orders can be placed for this Company is only a week.
What I'm not sure on is where to place the orders array, should it be a collection of it's own containing a link to the User and a link to the Company or should it sit under the Company or finally should the orders be sat under the User.
Numbers wise I need to plan for 50k+ in orders.
Queries wise, I'll probably be looking at Orders by Company mainly but I would need to find an Order by Company based for a specific user.
1) For folks coming from the SQL world (such as myself) one of the hardest learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
In the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
2) Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
How much data do you have?
What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries?
What are your most common queries?
How many I/O operations do you expect per second?
What you're talking about here is modeling Many-to-One relationships:
Company -> User
User -> Order
Order -> Order Lines
Company -> Order
Using SQL you would create a pair of master/detail tables with a primary key/foreign key relationship. In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach.
The correct approach would depend on a lot of details about the use case of your application, many of which you haven't provided.
3) This is my best guess - and it's only a guess - as to a good schema for you.
a) Have separate collections for Users, Companies, and Orders
If you're looking at 50k+ orders, there are too many to embed in a single document. Having them as a separate collection will allow you to reference them from both the Company and the User documents.
b) Have an array of references to the Order documents in both the Company and the User documents. This makes the query "Find all Orders for this Company" a single-document query
c) If your query pattern supports it, you might also have a duplicate link from Orders back to the owning Company and/or User.
d) Assuming that the order lines are unique to the individual Order, you would embed the Order Lines in an array within the Order documents.
e) If your order lines refer back to individual Products, you might want to have a separate Product collection, and include a reference to the Product document in the order line sub-document
4) Here are some good general references on MongoDB schema design.
MongoDB presentations:
http://www.10gen.com/presentations/mongosf2011/schemabasics
http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
http://www.10gen.com/presentations/mongosf2011/schemascale
Here are a couple of books about MongoDB schema design that I think you would find useful:
http://www.manning.com/banker/ (MongoDB in Action)
http://shop.oreilly.com/product/0636920018391.do
Here are some sample schema designs:
http://docs.mongodb.org/manual/use-cases/
Note that the "MongoDB in Action" book includes a sample schema for an e-commerce application, which is very similar to what you're trying to build -- I recommend you check it out.
I have a SQL database with two tables like this:
Users
Id (PK)
Name
Orders
Id (PK)
UserId (FK - User.Id)
Amount
I'd like to move this to a NoSQL (i.e. MongoDb) Key-Value store in the interest of avoiding joins (on very large result sets).
Does this structure make sense as-is to be moved to a KV database? If not, should I add another table like User_Orders relating users and orders?
I have a screen that displays Orders in a grid, but I'd also like to display the User name. In SQL I would use a join to pull this from the database.
Is there an equivalent in NoSQL (without join) other than querying the database once per Order.UserId to get the related user? If not, how could I apply (Distributed?) Map-Reduce in this instance to accomplish the same goal, assuming my architecture allows me to run multiple front-end and application servers?
Thanks!
A big change from a relational to a NoSQL database would be denormalization. Based on how often the user name changes in your system, you can simply add user name to the orders collection (a table in relational terms).
So, your orders collection schema would look like:
{"userId":"abc123", "userName": "Some Name", "orderId":"someorderId","amount":153.23}
You can use simple find() queries to get data about orders and users. If the name were to change, it'd be a multi-document-update but then if that does not happen often, its not that bad. For once in a blue-moon updates, denormalization is good as it benefits the reads. Again, this is not a rule of thumb but it is totally up to your use case and design to consider the reads:writes ratio.
If the user name does change very often, and you do not wish to denormalize, then you can always cache the userId to userName map with an appropriate TTL, and look up the ID -> Name in your application layer instead of using the database to impose business constraints.
You wont need map-reduce to just pull orders and users - unless you are doing massive aggregation of data.