How to deal with relationships while using mongodb - mongodb

I know, think in "denormalized way" or "nosql way".
but tell me about this simple use-case.
db.users
db.comments
some user post a comment, and i want to fetch some user data while fetching a comment.
say i want to show dynamic data, like "userlevel", and static data like "username".
with the static data i will never have problems, but what about the dynamic data?
userlevel is in users collation, i need the denormalized data duplicated into comments to archieve read performance but also having the userlevel updated.
is this archiveable in some way?

EDIT:
Just found an answer of Brendan McAdams, guy from 10gen, who is obviously way way authoritative than me, and he recommends to embed documents.
older text:
The first one is to manually include to each comment ObjectID of user they're belong.
comment: { text : "...",
date: "...",
user: ObjectId("4b866f08234ae01d21d89604"),
votes: 7 }
The second one, and clever way is to use DBRefs
we add extra I/O to our disk, losing performance am i right? (i'm not sure how this work internally) therefore we need to avoid linking if possible, right?
Yes - there would be one more query, but driver will do it for you - you can think of this as of kind of syntax sugar. Does it affect performance? Actually, it is depends too :) One of the reasons why Mongo so freaking fast is that it is using memory mapped files
and mongo try it best to keep all of the working set (plus indexes) directly in RAM. And every 60 seconds (by default) it syncs RAM snapshot with disk based file.
When I'm saying working set, I mean things you are working with: you can have three collections - foo, bar, baz, but if you are working now only with foo and bar, they ought to be loaded into ram, while baz stays on disk abandoned. Moreover memory mapped files allow as to load only part of the collection. So if you're building something like engadget or techcrunch there is high probability that working set would be comments for the last few days and old pages will be revived way less frequently (comments would be spawned to memory on demand), so it doesn't affect performance significally.
So recap: as long as you keep working set in memory (you may think that is read/write caching), fetching of those things is superfast and one more query wouldn't be a problem. If you working with a slices of data that doesn't fit into memory, there would be speed degradation, but I don't now your circumstances -- it could be acceptable, so in both cases I tend to choose do use linking.

Related

Should all data from the database be mapped to the model?

I am doing homework on restAPI using Go and MongoDB. But I'm still wondering:
As for whether I should create a dictionary to store data at the model level, it will help me to retrieve data much faster without accessing MongoDB. But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
In file models/account.go I have a struct Account and in MongoDB I also have a collection Account to save all account information of the website. Should I create Accountlist to store all the data in the database to increase performance?.
Source as below:
var AccountList map[int]*Account
type Account struct {
ID int
UserName string
Password string
Email string
Role string
}
As with many things in software, "It Depends".
There's not enough information about the systems involved, how often the data is being queried, mutated, and so on to give a concrete answer. But because this is for homework, we can give scenarios.
The root of your question is this: should you cache results from the database?
Is it really needed?
Academically, it's OK to over-optimize. You get to play with technologies and understand how they work. In the real world, we should understand where the need for something is before implementing it. The more complex a solution is, the more important making a correct trade-off becomes.
Caching is best when you're going to use the results more often than they're going to change, and fetching from storage is expensive.
"Expensive" can vary. One operation measured in seconds can be expensive. But so can tens, hundreds, or thousands of operations close together measured in 100ms.
How should you do it?
You called out a couple drawbacks. Most importantly:
But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
Synchronization is the most important thing for any distributed system.
It doesn't matter how you cache values if you have one server instance. But once you start adding instances, things get complex.
A common pattern for caching is to use a distributed key-value store. They allow you to store results which can be shared across applications — and invalidate them.
Application checks to see if the key exists in the store.
If so, use it.
If not, fetch from origin and update the cache for next time.
Separately, invalidate the key any time data needs updating.
There are a bunch of products to use. Redis is popular, memcached works. But since you're using Go, checkout groupcache: https://github.com/mailgun/groupcache. It was written by Google to simplify dl.google.com, and extended by Mailgun to support TTLs.

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.

Storing two way relational data in Redis

Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.
Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.

Maintaining consistency of mongodb data

What is the best practices, or tradeoffs, or effectiveness, of the two options below for maintaining consistency of data in MongoDB?
Manual caching with cron jobs (aka storing redundant data and using a script to periodically propagate changes)
Dynamically load data every time but have a cache layer (or utilize the built in mongodb cache)
For example, let's say there are comments and users. With option 1, each comment would contain:
{
user_id:
user_displayname:
user_gravatar:
[comment fields]
}
If the user decided to change his or her displayname, the user object would change but also a script would run the required MongoDB commands to update all the user's comments to reflect the change.
With option 2, each comment would contain:
{
user_id:
[comment fields]
}
If the user decided to change his or her displayname, it would only be changed in the user object itself. When a comment is accessed without hitting the cache, it'll associate the user object with the comment object in the cache. That way in the future, if this comment is accessed again while it is still in the cache, both user and comment queries are skipped. (am I basically describing the built in MongoDB cache?)
Is it worth doing the data redundancy described in option 1 at all? or is MongoDB smart enough that additional but equivalent queries are already cached? or is it worth using something else such as Redis to make a cache layer myself?
Thanks!
There is no "cache" in MongoDB itself. MongoDB uses memory-mapped files, and its performance depends very much on whether it can keep the most frequently used documents, your application's "working set", mapped in main memory rather than having to page each document in from disk prior to accessing it.
You are describing a denormalized database design, where each document contains attributes that would not be there in a normalized form. This can make sense, and it is in fact a very common technique with MongoDB, if it allows you to fetch all the data you need in a single operation, rather than having to do multiple queries.
The downside, as you point out, is that it requires more expensive updates, since you need to update all the documents into which a particular attribute has been denormalized. The downside is also that if your documents are larger, it may be more difficult to keep the working set in memory.
The answer therefore depends on your data access patterns. Generally, if your application is read-heavy, and it tends to need all of these denormalized attributes together, then the denormalizing approach is a good choice. If the application is write-heavy, and especially if it makes frequent updates to those particular attributes, then denormalization is not a good choice.
If you are talking about a caching mechanism for 100s of GB of data, you are talking about a serious trade off. Anything less than 5 GB of data, the tradeoffs do not matter. Between 100GB and 5GB, there is a grey area.
The worst case scenario for your data is this:
200 GB of data. 4,000 reads per second. A user with 9,000 comments changes his / her name. Your application also indexes comments on this name value. Your application must then update 9,000 comments and 9,000 index keys. This will cause serious drag in your application for a while.
Then, we must also pose the question for something as simple as names on comments: "Do you have to update the names on old comments?"
When you follow a new person on Twitter, your past timeline does not inherit the person's past tweets. Only your new timeline. Same with comments, why should you update the person's name on past comments?
So, I would add a #3 to your list: "Do not update users' names"