Maintaining consistency of mongodb data

Maintaining consistency of mongodb data - mongodb

What is the best practices, or tradeoffs, or effectiveness, of the two options below for maintaining consistency of data in MongoDB?
Manual caching with cron jobs (aka storing redundant data and using a script to periodically propagate changes)
Dynamically load data every time but have a cache layer (or utilize the built in mongodb cache)
For example, let's say there are comments and users. With option 1, each comment would contain:
{
user_id:
user_displayname:
user_gravatar:
[comment fields]
}
If the user decided to change his or her displayname, the user object would change but also a script would run the required MongoDB commands to update all the user's comments to reflect the change.
With option 2, each comment would contain:
{
user_id:
[comment fields]
}
If the user decided to change his or her displayname, it would only be changed in the user object itself. When a comment is accessed without hitting the cache, it'll associate the user object with the comment object in the cache. That way in the future, if this comment is accessed again while it is still in the cache, both user and comment queries are skipped. (am I basically describing the built in MongoDB cache?)
Is it worth doing the data redundancy described in option 1 at all? or is MongoDB smart enough that additional but equivalent queries are already cached? or is it worth using something else such as Redis to make a cache layer myself?
Thanks!

There is no "cache" in MongoDB itself. MongoDB uses memory-mapped files, and its performance depends very much on whether it can keep the most frequently used documents, your application's "working set", mapped in main memory rather than having to page each document in from disk prior to accessing it.
You are describing a denormalized database design, where each document contains attributes that would not be there in a normalized form. This can make sense, and it is in fact a very common technique with MongoDB, if it allows you to fetch all the data you need in a single operation, rather than having to do multiple queries.
The downside, as you point out, is that it requires more expensive updates, since you need to update all the documents into which a particular attribute has been denormalized. The downside is also that if your documents are larger, it may be more difficult to keep the working set in memory.
The answer therefore depends on your data access patterns. Generally, if your application is read-heavy, and it tends to need all of these denormalized attributes together, then the denormalizing approach is a good choice. If the application is write-heavy, and especially if it makes frequent updates to those particular attributes, then denormalization is not a good choice.

If you are talking about a caching mechanism for 100s of GB of data, you are talking about a serious trade off. Anything less than 5 GB of data, the tradeoffs do not matter. Between 100GB and 5GB, there is a grey area.
The worst case scenario for your data is this:
200 GB of data. 4,000 reads per second. A user with 9,000 comments changes his / her name. Your application also indexes comments on this name value. Your application must then update 9,000 comments and 9,000 index keys. This will cause serious drag in your application for a while.
Then, we must also pose the question for something as simple as names on comments: "Do you have to update the names on old comments?"
When you follow a new person on Twitter, your past timeline does not inherit the person's past tweets. Only your new timeline. Same with comments, why should you update the person's name on past comments?
So, I would add a #3 to your list: "Do not update users' names"

Related

Should all data from the database be mapped to the model?

I am doing homework on restAPI using Go and MongoDB. But I'm still wondering:
As for whether I should create a dictionary to store data at the model level, it will help me to retrieve data much faster without accessing MongoDB. But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
In file models/account.go I have a struct Account and in MongoDB I also have a collection Account to save all account information of the website. Should I create Accountlist to store all the data in the database to increase performance?.
Source as below:
var AccountList map[int]*Account
type Account struct {
ID int
UserName string
Password string
Email string
Role string
}

As with many things in software, "It Depends".
There's not enough information about the systems involved, how often the data is being queried, mutated, and so on to give a concrete answer. But because this is for homework, we can give scenarios.
The root of your question is this: should you cache results from the database?
Is it really needed?
Academically, it's OK to over-optimize. You get to play with technologies and understand how they work. In the real world, we should understand where the need for something is before implementing it. The more complex a solution is, the more important making a correct trade-off becomes.
Caching is best when you're going to use the results more often than they're going to change, and fetching from storage is expensive.
"Expensive" can vary. One operation measured in seconds can be expensive. But so can tens, hundreds, or thousands of operations close together measured in 100ms.
How should you do it?
You called out a couple drawbacks. Most importantly:
But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
Synchronization is the most important thing for any distributed system.
It doesn't matter how you cache values if you have one server instance. But once you start adding instances, things get complex.
A common pattern for caching is to use a distributed key-value store. They allow you to store results which can be shared across applications — and invalidate them.
Application checks to see if the key exists in the store.
If so, use it.
If not, fetch from origin and update the cache for next time.
Separately, invalidate the key any time data needs updating.
There are a bunch of products to use. Redis is popular, memcached works. But since you're using Go, checkout groupcache: https://github.com/mailgun/groupcache. It was written by Google to simplify dl.google.com, and extended by Mailgun to support TTLs.

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.

Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.

Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

Storing two way relational data in Redis

Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.

Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.

Is it better to use multiple databases when you are managing independent sets of things in MongoDB?

If, as an example, you have a blogging website done with MongoDB to store data
Is it better to have a database per blogger? given that their blogs and comments are completely independent from other bloggers. Or just lump everything together? or it doesn't make too much difference?
I'm imagining the same web app (not independent webs/urls per blogger) is used by all bloggers. So when someone logs in / accesses the blog the code would find the right database to use and haul data out it.
Does this have any downsides? is this normal for handling these kinds of things?

I am making plenty of assumptions about your needs. But, generally, there are 3 paths to multi-tenant apps in MongoDB:
Single collection per customer; never, ever do this.
Single database per customer. Good. You will trade off free space if your product is on the freemium model. Either way, you will want to run with "smallfiles" option. As stated, you will build the routing system for your environment. Thus, you will want to connect to the proper database for the proper customer.
customer_id key per document + path slug. Good. The trade off here is recovery of free space. Traditionally, MongoDB does not recover space used by deleted documents. Thus customers creating and deleting blog posts would create unused space. By using 'usePowerOf2Sizes' collections, you will recover disk space of deleted documents. However, 'usePowerOf2Sizes' creates bloated padding space.
To get over the disk space padding, take a look at the compression used here: http://blog.appsignal.com/blog/2013/07/30/taming-mongodb-disk-usage.html
Recap, I would recommend using customer_id plus the compression. It gives you the best of both worlds.

As stated in the comments under the original question, there's really no performance benefit to splitting up your MongoDB store into separate databases per blogger, due to the overhead of having each database and minimum storage.
On the flipside: You are going to make some cross-user analysis more difficult for yourself. As a very simple example, based on your blogging example: Imagine you want to look at average post count per user. This is pretty simple if your users (and posts) are in the same database (typically in the same collections), and you can likely use the aggregation framework for this task. This task will not be so straightforward with an unbounded number of databases, where you'll need to first enumerate all databases, then perform your aggregations/averaging once per database. This could end up being a slower operation than within a single-database architecture.
Having said all that: You still might have some reason to split data across databases. Maybe you have to separate data due to legal reasons, or to ensure customers that their sensitive data won't be commingled with other companies' data. Maybe your customer needs full read/write access to their database, and so you use per-database configuration as a security boundary. I'm sure there are other reasons as well...

It is perfectly normal to allocate 100's of databases if that is all you will see.
Database separation can have many benefits. They can be sharded independantly, since sharding occurs on database level. Databases also have the upside of being completely isolated instances (including locks) of the data within them (good example: space allocation occurs on database level).
This means they can be moved around the network as users data is accessed more and since a single users data might not be that big it would be easier than moving all of your users data to a more powerful node.
However, you must consider the problematic sides in the application of managing the connections to each database. There will be over head on it and you will need to have far more complex coding than what is considered standard.
Considering space, you will not see a drastic usage of space. The most problematic part of using separate databases is the journal allocation. Every collection you use in separate databases will also, of course, pre-allocate itself but this is actually considered one of the upsides to using database separation (movement of databases between nodes, isolation).
So the space problem is really only a problem if your scenario makes it one.
is this normal for handling these kinds of things?
For a normal blogger site, no, and I do not know enough about the complexities of your scenario to say any different. Normal operation would be to lump everything together, since you could see into the region of 1,000's maybe 1,000,000's of users and database separation just won't scale over that very well.

How to deal with relationships while using mongodb

I know, think in "denormalized way" or "nosql way".
but tell me about this simple use-case.
db.users
db.comments
some user post a comment, and i want to fetch some user data while fetching a comment.
say i want to show dynamic data, like "userlevel", and static data like "username".
with the static data i will never have problems, but what about the dynamic data?
userlevel is in users collation, i need the denormalized data duplicated into comments to archieve read performance but also having the userlevel updated.
is this archiveable in some way?

EDIT:
Just found an answer of Brendan McAdams, guy from 10gen, who is obviously way way authoritative than me, and he recommends to embed documents.
older text:
The first one is to manually include to each comment ObjectID of user they're belong.
comment: { text : "...",
date: "...",
user: ObjectId("4b866f08234ae01d21d89604"),
votes: 7 }
The second one, and clever way is to use DBRefs
we add extra I/O to our disk, losing performance am i right? (i'm not sure how this work internally) therefore we need to avoid linking if possible, right?
Yes - there would be one more query, but driver will do it for you - you can think of this as of kind of syntax sugar. Does it affect performance? Actually, it is depends too :) One of the reasons why Mongo so freaking fast is that it is using memory mapped files
and mongo try it best to keep all of the working set (plus indexes) directly in RAM. And every 60 seconds (by default) it syncs RAM snapshot with disk based file.
When I'm saying working set, I mean things you are working with: you can have three collections - foo, bar, baz, but if you are working now only with foo and bar, they ought to be loaded into ram, while baz stays on disk abandoned. Moreover memory mapped files allow as to load only part of the collection. So if you're building something like engadget or techcrunch there is high probability that working set would be comments for the last few days and old pages will be revived way less frequently (comments would be spawned to memory on demand), so it doesn't affect performance significally.
So recap: as long as you keep working set in memory (you may think that is read/write caching), fetching of those things is superfast and one more query wouldn't be a problem. If you working with a slices of data that doesn't fit into memory, there would be speed degradation, but I don't now your circumstances -- it could be acceptable, so in both cases I tend to choose do use linking.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse