MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view? - mongodb

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).

That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

Related

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.

Statistics (& money transfers) with mongoDB

1) My first questions is regarding the best solution to store statistics with mongoDB
If i want to store large amounts of statistics (lets say visitors on a specific site - down to hourly), a noSQL DB like mongoDB seems to work very fine. But how do I structure those tables to get the most out of mongoDB?
I'd increase the visitor amount for that specific object id (for example SITE_MONTH_DAY_YEAR_SOMEOTHERFANCYPARAMETER) by one every time a user visits the page. But if the database gets big (>10g), doesnt that slow down (like it would on mysql) because it has to search for the object_id and update it? Is the data always accurate when I update it (afaik mongoDB does not have any table locking?)
Wouldnt it be faster just INSERTING one row for every visitor? (and more accurate) On the other hand, reading the statistics would be much faster with my first solution, wouldnt it? (especially in terms of "grouping" by site/date/[...]).
2) For every visitor counted I'd like to make a money transfer between two users. It is crucial that those transfers are always accurate. How would you achieve that?
I was thinking about a hourly cron that picks the amount of vistiors from the mongoDB.statistics for the last hour and updates the users balance. I'd prefer doing this directly/live while counting the user - but what happens if thousands of visitors are calling the script simultaneously, is there any risk of getting wrong balances?

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.