What is the best way to access relational data in DynamoDB? Is it best to store information redundantly and update it as needed, or to query every time you need access to relational data?
For example say I have a Team table that has a Members field that stores an array of member ids that also include the member's name.
Team
Members: [
{member1: "John"},
{member2: "Sam"},
{member3: "Pam"}
]
Whenever my application needs access to a team's members, all I have to do is call get on the teams members field. Then if I need additional team member data I can further call get on the entire member record.
One of the concerns I have with this method is having to update the names inside of the members array each time a person updates their name. Not to mention all of the other places the user's name could also be stored.
My main question then is whether I should be instead querying for these records every time instead of storing this data inside of the team record.
For example should I instead query all members who have team 123?
Would querying be more expensive to continually do, or would having to update all of this dependent data be more expensive?
I know that the query route is less of a headache then trying to prevent data anomalies, but querying may also require multiple calls to get the same data.
dynamodb write operations are more expensive than read ops, and become more when you add some secondary indexes to the table (or partition)
Also, synchronizing values in different items may charge additional fees if the sync operation must be done in a transaction to ensure atomicity
It meanly depends on your application needs and your budget i.e identify how many requests per use case you will approximatively have during a period (ex: month) then try to optimize the cost of use cases that generate the most of your traffic, regardless of their type (read or write).
For some read use cases, you may accept the eventual consistency of the read model and use a cache (e.g S3 files) it still an option...
Related
I'm new to MongoDB and trying to wrap my head around managing duplicate data. The Extended Reference Pattern (link) is a good example. When you have two related collections (e.g., Customers and Orders), it can make sense for performance reasons to duplicate some information that would otherwise just live in the referenced collection. So for instance, the Order collection might duplicate the customer's name to avoid unnecessary joins with some queries.
I totally get that. And I totally get that you should be careful about what data you duplicate ("it works best if [duplicated fields] don't frequently change"), as updating those records can be expensive. What I don't understand is how you're supposed to keep track of where all that data is housed. Suppose you do need to update a customer's name. If that information duplicated in multiple orders within the Order Collection, plus maybe one or two other collections, tracking down where all the customer name lives (and the mechanics of changing it) sounds like a logistical nightmare!
Is there some sort of Mongo voodoo magic that can help with these sorts of updates, or is that just a necessarily messy process?
you have to manage all that changes on your app, so you have to take care when select one pattern or another, they are not silver bullets.
and remember not all the data need to be updated, depends of the situation, the data and the context of your app.
I want to restrict access to sensitive attributes of my Users documents to a smaller set of clients. My current understanding is that there are two ways to split the data, so that we can make security rules for each part:
Create a Users collection and a top level SensitiveUserData collection that both use the same document ID, and only retrieve the SensitiveUserData for a user when needed and allowed.
Create a SensitiveUserData subcollection within the User document. This collection will always contain just a single document, but the ID won't matter.
Which of these (or a third) is preferred in general?
Neither of these approaches is pertinently better than the other, and both have valid use-cases. In the end it's a combination of personal preference, and a (typically evolving) insight into the use-cases of your app.
In many scenarios, using subcollections is preferred as it allows the data to be better spread out over the physical storage, which in turn helps throughput. But in this case I doubt that makes a difference, as you're likely to use the user ID as keys in both SensitiveUserData and Users collections, so they'll be similarly distributed anyway.
For me personally, I often end up with a top-level collection. But that may well be related to my long history of modeling data in the Firebase Realtime Database, where access permission is inherited, so you can't hide a subcollection there.
Currently, our system is not entirely normalized, and we use meteor-publish-composite to obtain the normalized data in mongodb. Some models have very few dependencies, but others have arrays of objects (i.e. sub-documents) with few foreign keys that we are subscribing to when fetching each model.
An example would be a Post containing a list of Comment sub-documents, where each comment has a userId field.
My question is, while I know it would be faster to use collection hooks and update the collection with data denormalization, how does Meteor handle multiple subscriptions on the same collection?
Is a hundred subscriptions on the same collection affect the application speed (significantly)? What about a thousand? etc.
This may not fully answer your question, however after spending countless hours tuning the performance of a large meteor app, I thought I would share some of the things that I have learned.
In Meteor, when you define a publication, you are setting up a reactive query that continues to push data to subscribed clients when changes to the underlying mongo data causes the result of the query to change. In other words, it sets up a query that will continually push data to clients as the data is inserted, updated, or removed. The mechanism by which it does this is by creating an observer on the query.
When an observer is initialized (e.g. when publication is subscribed to), it will query mongodb for the initial dataset to send down and then use the oplog to detect changes going forward. Fortunately, meteor is able to re-use an existing observer for a new subscription if the query is for the same collection, same selectors, and same options.
This means that you could create hundreds of subscriptions against many different publications, but if they are hitting against the same collection and using the same query selectors then you effectively only have 1 observe in play. For more details, I highly recommend reading this article from kadira.io (from which I acquired the information I used in this answer).
In addition to this, Meteor is also able to deal with multiple publications publishing the same document, and when this occurs, the documents will be merged into one. See this for more detail.
Lastly, because of Meteor's MergeBox component, it will minimize the data being sent over the wire across all your subscriptions by keeping track of what data changed vs. what is already on the client.
Therefore, in your specific example, it sounds like you will be running several different subscriptions on effectively the same query (since you are just trying to de-normalize your data) and dataset. Because of all the optimizations that I described above, I would guess that you won't be plagued by performance issues by taking this approach.
I have done similar things in one of my apps and have never had an issue.
I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.
I have two collections with a many-to-many relationship. I want to store an array of linked ObjectIds in both documents so that I can take Document A and retrieve all linked Document B's quickly, and vice versa.
Creating this link is a two step process
Add Document A's ObjectId to Document B
Add Document B's ObjectId to Document A
After watching a MongoDB video I found this to be the recommended way of storing a many-to-many relationship between two collections
I need to be sure that both updates are made. What is the recommended way of robustly dealing with this crucial two step process without a transaction?
I could condense this relationship into a single link collection, the advantage being a single update with no chance of Document B missing the link to Document A. The disadvantage being that I'm not really using MongoDB as intended. But, because there is only a single update, it seems more robust to have a link collection that defines the many-to-many relationship.
Should I use safe mode and manually check the data went in afterwards and try again on failure? Or should I represent the many-to-many relationship in just one of the collections and rely on an index to make sure I can still quickly get the linked documents?
Any recommendations? Thanks
#Gareth, you have multiple legitimate ways to do this. So they key concern is how you plan to query for the data, (i.e.: what queries need to be fast)
Here are a couple of methods.
Method #1: the "links" collection
You could build a collection that simply contains mappings between the collections.
Pros:
Supports atomic updates so that data is not lost
Cons:
Extra query when trying to move between collections
Method #2: store copies of smaller mappings in larger collection
For example: you have millions of Products, but only a hundred Categories. Then you would store the Categories as an array inside each Product.
Pros:
Smallest footprint
Only need one update
Cons:
Extra query if you go the "wrong way"
Method #3: store copies of all mappings in both collections
(what you're suggesting)
Pros:
Single query access to move between either collection
Cons:
Potentially large indexes
Needs transactions (?)
Let's talk about "needs transactions". There are several ways to do transactions and it really depends on what type of safety you require.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
You can definitely do this. You'll have to ask yourself, what's the worst that happens if only one of the saves fails?
Method #4: queue the change
I don't know if you've ever worked with queues, but if you have some leeway you can build a simple queue and have different jobs that update their respective collections.
This is a much more advanced solution. I would tend to go with #2 or #3.
Why don't you create a dedicated collection holding the relations between A and B as dedicated rows/documents as one would do it in a RDBMS. You can modify the relation table with one operation which is of course atomic.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
Yes this an approach, but there is an another - you can implement an optimistic transaction. It has some overhead and limitations but it guarantees data consistency. I wrote an example and some explanation on a GitHub page.