DynamoDB - How to handle updates using adjacency list pattern? - nosql

So, in DynamoDB the reccomended approach to a many-to-many relationship is using Adjacency List Pattern.
Now, it works great for when you need to read the Data because you can easily read several items with one request.
But what if I need to update/delete the Data? These operations happen on a specific item instead of a query result.
So if I have thousands of replicated data to facilitate a GET operation, how am I going to update all of these replicas?
The easiest way I can think of is instead of duplicating the data, I only store an immutable ID, but that's pretty much emulating a relational database and will take at least 2 requests.

Simple answer: You just update the duplicated items :) AFAIK redundant data is preferred in NoSQL databases and there are no shortcuts to updating data.
This of course works best when read/write ratio of the data is heavily on the read side. And in most everyday apps that is the case (my gut feeling that could be wrong), so updates to data are rare compared to queries.
DynamoDB has a couple of utils that might be applicable here. Both have their shortcomings though
BatchWriteItem allows to put or delete multiple items in one or more tables. Unfortunately, it does not allow updates, so probably not applicable to your case. The number of operations is also limited to 25.
TransactWriteItems allows to perform an atomic operation that groups up to 10 action requests in one or more tables. Again the number of operations is limited for your case
My understanding is that both of these should be used with caution and consideration, since they might cause performance bottlenecks for example. The simple way of updating each item separately is usually just fine. And since the data is redundant, you can use async operations to make multiple updates in parallel.

Related

Storing two way relational data in Redis

Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.
Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.

Why would using a nosql/document/MongoDB as a relational database be inferior?

I have recently been introduced to MongoDB and I've come to like a lot (compared to MySQL i used for all projects).
However in some certain situations, storing my data with documents "linking" to each other with simple IDs makes more sense (to reduce duplicated data).
For example, I may have Country and User documents, where a user's location is actually an ID to a Country (since a Country document includes more data, hence duplicating Country data in each user makes no sense).
What I am curious about is.. why would MongoDB be inferior compared to using a proper relationship database?
Is it because I can save transactions by doing joins (as opposed to doing two transactions with MongoDB)?
Thats a good question..!!
I would say there is definitely nothing wrong in using nosql db for the type of data you have described. For simple usecases it will work perfectly well.
The only point is that relational databases have been designed long time back to serve the purpose of storing and querying WELL STRUCTURED DATA.. with proper relations defined. Hence for a large amount of well structured data the performance and the features provided will be a lot more than that provided by a nosql database. Since they are more matured.. its their ball game..!!
On the other hand nosql databases have been designed to handle very large amount of unstructured data and has out of the box support for distributed environment scaling. So its a completely different ball game now..
They basically treat data differently and hence have different strategies / execution plans to fetch a given data..
MongoDB was designed from the ground up to be scalable over multiple servers. When a MongoDB database gets too slow or too big for a single server, you can add additional servers by making the larger collections "sharded". That means that the collection is divided between different servers and each one is responsible for managing a different part of the collection.
The reason why MongoDB doesn't do JOINs is that it is impossible to have JOINs perform well when one or both collections are sharded over multiple nodes. A JOIN requires to compare each entry of table/collection A with each one of table/collection B. There are shortcuts for this when all the data is on one server. But when the data is distributed over multiple servers, large amounts of data need to be compared and synchronized between them. This would require a lot of network traffic and make the operation very slow and expensive.
Is it correct that you have only two tables, country and user. If so, it seems to me the only data duplicated is a foreign key, which is not a big deal. If there is more duplicated, then I question the DB design itself.
In concept, you can do it in NOSQL but why? Just because NOSQL is new? OK, then do it to learn but remember, "if it ain't broke, don't fix it." Apparently the application is already running on relational. If the data is stored in separate documents in MongoDB and you want to interrelate them, you will need to use a link, which will be more work than a join and be slower. You will have to store a link, which would be no better than storing the foreign key. Alternatively, you can embed one document in another in MongoDB, which might even increase duplication.
If it is currently running on MySQL then it is not running on distributed servers, so Mongo's use of distributed servers is irrelevant. You would have to add servers to take advantage of that. If the tables are properly indexed in relational, it will not have to search through large amounts of data.
However, this is not a complex application and you can use either. If the data is stored on an MPP environment with relational, it will run very well and will not need to search to large amounts of data at all. There are two requirements, however, in choosing a partitioning key in MPP: 1. pick one that will achieve an even distribution of data; and 2. pick a key that can allow collocation of data. I recommend you use the same key as the partitioning key (shard key) in both files.
As much as I love MongoDB, I don't see the value in moving your app.

Is it better to use multiple databases when you are managing independent sets of things in MongoDB?

If, as an example, you have a blogging website done with MongoDB to store data
Is it better to have a database per blogger? given that their blogs and comments are completely independent from other bloggers. Or just lump everything together? or it doesn't make too much difference?
I'm imagining the same web app (not independent webs/urls per blogger) is used by all bloggers. So when someone logs in / accesses the blog the code would find the right database to use and haul data out it.
Does this have any downsides? is this normal for handling these kinds of things?
I am making plenty of assumptions about your needs. But, generally, there are 3 paths to multi-tenant apps in MongoDB:
Single collection per customer; never, ever do this.
Single database per customer. Good. You will trade off free space if your product is on the freemium model. Either way, you will want to run with "smallfiles" option. As stated, you will build the routing system for your environment. Thus, you will want to connect to the proper database for the proper customer.
customer_id key per document + path slug. Good. The trade off here is recovery of free space. Traditionally, MongoDB does not recover space used by deleted documents. Thus customers creating and deleting blog posts would create unused space. By using 'usePowerOf2Sizes' collections, you will recover disk space of deleted documents. However, 'usePowerOf2Sizes' creates bloated padding space.
To get over the disk space padding, take a look at the compression used here: http://blog.appsignal.com/blog/2013/07/30/taming-mongodb-disk-usage.html
Recap, I would recommend using customer_id plus the compression. It gives you the best of both worlds.
As stated in the comments under the original question, there's really no performance benefit to splitting up your MongoDB store into separate databases per blogger, due to the overhead of having each database and minimum storage.
On the flipside: You are going to make some cross-user analysis more difficult for yourself. As a very simple example, based on your blogging example: Imagine you want to look at average post count per user. This is pretty simple if your users (and posts) are in the same database (typically in the same collections), and you can likely use the aggregation framework for this task. This task will not be so straightforward with an unbounded number of databases, where you'll need to first enumerate all databases, then perform your aggregations/averaging once per database. This could end up being a slower operation than within a single-database architecture.
Having said all that: You still might have some reason to split data across databases. Maybe you have to separate data due to legal reasons, or to ensure customers that their sensitive data won't be commingled with other companies' data. Maybe your customer needs full read/write access to their database, and so you use per-database configuration as a security boundary. I'm sure there are other reasons as well...
It is perfectly normal to allocate 100's of databases if that is all you will see.
Database separation can have many benefits. They can be sharded independantly, since sharding occurs on database level. Databases also have the upside of being completely isolated instances (including locks) of the data within them (good example: space allocation occurs on database level).
This means they can be moved around the network as users data is accessed more and since a single users data might not be that big it would be easier than moving all of your users data to a more powerful node.
However, you must consider the problematic sides in the application of managing the connections to each database. There will be over head on it and you will need to have far more complex coding than what is considered standard.
Considering space, you will not see a drastic usage of space. The most problematic part of using separate databases is the journal allocation. Every collection you use in separate databases will also, of course, pre-allocate itself but this is actually considered one of the upsides to using database separation (movement of databases between nodes, isolation).
So the space problem is really only a problem if your scenario makes it one.
is this normal for handling these kinds of things?
For a normal blogger site, no, and I do not know enough about the complexities of your scenario to say any different. Normal operation would be to lump everything together, since you could see into the region of 1,000's maybe 1,000,000's of users and database separation just won't scale over that very well.

cqrs query performance

I'd like to know when you should consider using multiple table in your query store.
For example, consider the problem where a product has it's description changed. This change could potentially have a massive impact on the synchronisation of the read only query store if you had many aggregates that included the product description.
At which point should you consider a slight normalization of the data to avoid lengthy synchronisation issues? Is this a no-no or an acceptable compromise?
Thanks,
CQRS is not about using table-per-view, rather table-per-view is an aspect of a system that CQRS makes easier.
It's up to you and depends on your specific context and needs. I would look at it this way, what is the cost of the eventual consistency of that query vs. the need for high query performance. You may want to consider the following two characteristics of your system:
1) The avg. consistency of that command, i.e., how long it takes to update all of the read models affected by the command (also consider whether an optimized stored-proc for the change would outperform say using an ORM or other abstraction to update your database in this way).
My guess is unless you are talking millions, upon millions of records the consistency here is sufficient to meet your requirements and user expectations for consistency, maybe a few seconds.
2) The importance of query performance. How many queries are you getting per second? Can you handle doing a SQL join every time?
In most practical scenarios the optimization of either of these things is moot. You can probably do the update, regardless of records, using a good SP in seconds which is more than enough consistency for a UI refresh (keep in mind the UI that issued the command can be consistent as soon as they know the command succeeded).
And you usually don't need so much query scaling in a system that a single join will hurt you. What you may not want is the added internal complexity of performing these joins in your code and stored procs.
As with all things in CQRS, you don't need to use and optimize every aspect of it from day one. You can optimize these things incrementally. Use joins today, and fully denormalize tomorrow, or vice-versa.

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.