I am doing homework on restAPI using Go and MongoDB. But I'm still wondering:
As for whether I should create a dictionary to store data at the model level, it will help me to retrieve data much faster without accessing MongoDB. But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
In file models/account.go I have a struct Account and in MongoDB I also have a collection Account to save all account information of the website. Should I create Accountlist to store all the data in the database to increase performance?.
Source as below:
var AccountList map[int]*Account
type Account struct {
ID int
UserName string
Password string
Email string
Role string
}
As with many things in software, "It Depends".
There's not enough information about the systems involved, how often the data is being queried, mutated, and so on to give a concrete answer. But because this is for homework, we can give scenarios.
The root of your question is this: should you cache results from the database?
Is it really needed?
Academically, it's OK to over-optimize. You get to play with technologies and understand how they work. In the real world, we should understand where the need for something is before implementing it. The more complex a solution is, the more important making a correct trade-off becomes.
Caching is best when you're going to use the results more often than they're going to change, and fetching from storage is expensive.
"Expensive" can vary. One operation measured in seconds can be expensive. But so can tens, hundreds, or thousands of operations close together measured in 100ms.
How should you do it?
You called out a couple drawbacks. Most importantly:
But the big problem here is to synchronize the data under MongoDB and in the dictionary that I created.
Synchronization is the most important thing for any distributed system.
It doesn't matter how you cache values if you have one server instance. But once you start adding instances, things get complex.
A common pattern for caching is to use a distributed key-value store. They allow you to store results which can be shared across applications — and invalidate them.
Application checks to see if the key exists in the store.
If so, use it.
If not, fetch from origin and update the cache for next time.
Separately, invalidate the key any time data needs updating.
There are a bunch of products to use. Redis is popular, memcached works. But since you're using Go, checkout groupcache: https://github.com/mailgun/groupcache. It was written by Google to simplify dl.google.com, and extended by Mailgun to support TTLs.
Related
We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.
Over the last few days I've been working on a very simple web service for myself (and a few others) that allows me to keep track of books that I've read and when I've read them. Whilst storing users and books (titles + authors + maybe more data in the future) is relatively simple because they can just be stored as hashes with keys user:username and book:uniqueID respectively storing which users read which books and when is proving to be a bit more challenge.
My original plan was to have a sorted set for a user (user:username:readbooks) that used the timestamp as a score (for when the user read the book) and each book's unique ID as the value. The problem with this approach is that I can't store that a user has read a book twice (as you can't have duplicate values in a set). It also means that in order to track readers of a book I have to add them to a second set readersof:bookID.
My current approach that is rather than directly storing book IDs in the set user:username:readbooksto instead store a value in the form uniqueReadingEventId.bookId, however the problem with this is that if I delete a book (rather than the unique reading event) I have to iterate through every user in the set readersof:bookID, iterate through every value in user:username:readbooks and deleting values that match x.bookId, which seems a little inefficient. Furthermore, I may want to find users that have read two or more books in common.
My question is therefore two fold: is there a simpler way to structure my data in Redis or is my data better structured to a different NoSQL system? I would really like to continue working with Redis because I like its API, however because it is a personal project it doesn't really matter what I use.
Unless you need really high throughput here for some reason, it doesn't sound like Redis is the right choice. It sounds like you want to store a lot of document level information, and neither high-throughput nor data structures are a huge concern for you. To me that screams for just using SQL. Your data is very schematic-- and from what you've said, there's really no reason SQL wouldn't best and most simply fit your use case. If you're married to the idea of using NoSQL, one of the more general use-case databases like Mongo would also serve well.
Redis as a persistent database is specialized for cases where you need high throughput, data structures are useful, and you don't mind paying the extra cost of keeping everything in memory instead of much less expensive HD space. There are lots of scenarios where Redis fits perfectly, but yours isn't one of them.
What is the best practices, or tradeoffs, or effectiveness, of the two options below for maintaining consistency of data in MongoDB?
Manual caching with cron jobs (aka storing redundant data and using a script to periodically propagate changes)
Dynamically load data every time but have a cache layer (or utilize the built in mongodb cache)
For example, let's say there are comments and users. With option 1, each comment would contain:
{
user_id:
user_displayname:
user_gravatar:
[comment fields]
}
If the user decided to change his or her displayname, the user object would change but also a script would run the required MongoDB commands to update all the user's comments to reflect the change.
With option 2, each comment would contain:
{
user_id:
[comment fields]
}
If the user decided to change his or her displayname, it would only be changed in the user object itself. When a comment is accessed without hitting the cache, it'll associate the user object with the comment object in the cache. That way in the future, if this comment is accessed again while it is still in the cache, both user and comment queries are skipped. (am I basically describing the built in MongoDB cache?)
Is it worth doing the data redundancy described in option 1 at all? or is MongoDB smart enough that additional but equivalent queries are already cached? or is it worth using something else such as Redis to make a cache layer myself?
Thanks!
There is no "cache" in MongoDB itself. MongoDB uses memory-mapped files, and its performance depends very much on whether it can keep the most frequently used documents, your application's "working set", mapped in main memory rather than having to page each document in from disk prior to accessing it.
You are describing a denormalized database design, where each document contains attributes that would not be there in a normalized form. This can make sense, and it is in fact a very common technique with MongoDB, if it allows you to fetch all the data you need in a single operation, rather than having to do multiple queries.
The downside, as you point out, is that it requires more expensive updates, since you need to update all the documents into which a particular attribute has been denormalized. The downside is also that if your documents are larger, it may be more difficult to keep the working set in memory.
The answer therefore depends on your data access patterns. Generally, if your application is read-heavy, and it tends to need all of these denormalized attributes together, then the denormalizing approach is a good choice. If the application is write-heavy, and especially if it makes frequent updates to those particular attributes, then denormalization is not a good choice.
If you are talking about a caching mechanism for 100s of GB of data, you are talking about a serious trade off. Anything less than 5 GB of data, the tradeoffs do not matter. Between 100GB and 5GB, there is a grey area.
The worst case scenario for your data is this:
200 GB of data. 4,000 reads per second. A user with 9,000 comments changes his / her name. Your application also indexes comments on this name value. Your application must then update 9,000 comments and 9,000 index keys. This will cause serious drag in your application for a while.
Then, we must also pose the question for something as simple as names on comments: "Do you have to update the names on old comments?"
When you follow a new person on Twitter, your past timeline does not inherit the person's past tweets. Only your new timeline. Same with comments, why should you update the person's name on past comments?
So, I would add a #3 to your list: "Do not update users' names"
I am a newbie to caching and have no idea how data is stored in caching. I have tried to read a few examples online, but everybody is providing code snippets of storing and getting data, rather than explaining how data is cached using memcache. I have read that it stores data in key, value pairs , but I am unable to understand where are those key-value pairs stored?
Also could someone explain why is data going into cache is hashed or encrypted? I am a little confused between serialising data and hashing data.
A couple of quotes from the Memcache page on Wikipedia:
Memcached's APIs provide a giant hash
table distributed across multiple
machines. When the table is full,
subsequent inserts cause older data to
be purged in least recently used (LRU)
order.
And
The servers keep the values in RAM; if
a server runs out of RAM, it discards
the oldest values. Therefore, clients
must treat Memcached as a transitory
cache; they cannot assume that data
stored in Memcached is still there
when they need it.
The rest of the page on Wikipedia is pretty informative, and it might help you get started.
They are stored in memory on the server, that way if you use the same key/value often and you know they won't change for a while you can store them in memory for faster access.
I'm not deeply familiar with memcached, so take what I have to say with a grain of salt :-)
Memcached is a separate process or set of processes that store a key-value store in-memory so they can be easily accessed later. In a sense, they provide another global scope that can be shared by different aspects of your program, enabling a value to be calculated once, and used in many distinct and separate areas of your program. In another sense, they provide a fast, forgetful database that can be used to store transient data. The data is not stored permanently, but in general it will be stored beyond the life of a particular request (it is possible for Memcached to never store your data, so every read will be a miss, but that's generally an indication that you do not have it set up correctly for your use case).
The data going into cache does not have to be hashed or encrypted (but both things can happen to the data, depending on the caching mechanism.)
Serializing data actually has nothing to do with either concept -- instead, it is the process of changing data from one format (generally one suited for in-memory storage) to another one (generally suitable for storage in a persistent medium.) Another term for this process is marshalling and unmarshalling.
I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.