I am using "value eviction" for deleting data from RAM. Is there a way to find the oldest document with both data + metadata in the bucket (in RAM)?
No, there is no way to do what you're asking, because Couchbase doesn't expose these internal parameters. However, perhaps there is another way to accomplish whatever it is you're trying to do. Could you please describe what you want to accomplish?
Edited in response to the comment below:
As a general rule, you should use the default (value eviction) unless you have a compelling reason to switch to full eviction. Even though, ostensibly, you "save" more RAM by using full eviction, you're actually trading off performance on some operations that will now have to hit the disk instead of returning a response from memory. Specifically, cache misses or existence checks are more expensive (IO-wise), as well as some types of update operations.
Some cases where you would consider using full eviction are when your dataset is much larger than your working set (i.e the 'hot' data that's accessed frequently and should be in cache), or when you have to store a very large number of small values, such as when keeping a mapping of GUID-to-GUID, which means that the value is actually smaller than the metadata+key.
Related
I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.
We are having production servers with high volume of data with Value Eviction Buckets. Since we are running out of memory we have decided to change the eviction mode to Full Eviction. If we do this
Is there any impact for live operations ?
Is there any process running ? (Ex: like re balancing)
What are the pros and cons ?
Yes there are. There are not many, but that operation requires the memcached processes to be restarted on all nodes at the same time and warm up the caches. So you will incur downtime of course. How much depends on a few factors.
Not that I can think of. It just has to restart the processes.
Pros: You have more room in RAM as the meta-data is ejected now in addition to values. Cons: If you have it in your code to do any operation that checks for the existence of an object first, it will be much slower. I will give you an example. If you do an upsert, the DB has to check if that object exists first as part of the process. If you are running value eviction, it checks the for the metadata object in RAM which is super quick. That object ID is either there or not. If you are running with full eviction, now Couchbase has to go to disk to look through the meta-data there. As you might imagine, there is a penalty for that, which depending on some factors could be large.
IMO, running out of memory is not a good enough reason to move to full eviction. You need to have a functional reason. Without knowing more information (resident ratios, RAM size, cache sizes, etc. Etc.), you are probably better off adding more servers or larger ones, your choice. Keeping Couchbase properly sized, like most databases, but especially Couchbase is critical to a well functioning system. If you have an Enterprise contract with Couchbase, their Support team can help you with this. If not, read the documentation on this REALLY carefully before you turn on this feature. Like I said, have more than "I am running out of RAM" as the reason you are changing how the DB works, otherwise you may be doing more harm than good.
Items stored in Memcached seem to disappear without reason (TTL: 86400 but sometimes gone within 60s). However there's enough free space, and stats give zero evictions.
The items that get lost seem to be the larger items. They seem to disappear after adding some other big items. Could it be the case "The slab" for larger items is full and items are being evicted without being reported?
Memcached version 1.4.5.
Keys can get evicted before their expiration in memcached; this is a side effect of how memcached handles memory (see this answer for more details).
If the items you are storing are large enough that this is becoming a problem, memcached may be the wrong tool for the task you are trying to perform. You essentially have 2 practical options in this scenario:
break down the data you're trying to cache in smaller chunks
if this isn't feasible for any reason, you will have to use some sort of permanent storage, the nature of which will be dependent on the nature of data you're trying to store (choices would include redis, mongodb, SQL database, filesystem, etc.)
First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)
MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.
This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).
In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes
If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.
I use memcached to store the integer result of a complex calculation. I've got hundreds of integer objects that I could cache! Should I cache them under a single key in a more complex object or should I use hundreds of different keys for the objects? (the objects I'm caching do not need to be invalidated more than once a day)
I would say lots of little keys. This way you can get the exact result you want in 1 call with minimal serialization effort.
If you store it in another object (an array for example) you will have to fetch the array from cache and then fetch the item you actually want again from that array, plus you have the overhead of serializing/deserializing the whole complex object again. Depending on your language of choice this might mean manually writing a serialization/deserialization function from scratch.
I wrote somewhat large analysis at http://dammit.lt/2008/12/25/memcached-for-small-objects/ - it outlines how to optimize memcached for small object storage - it may shed quite some light on the issue.
It depends on your application. While memcached is very fast, it does require some request transmission and memory lookup time per request. Those numbers increase depending on whether or not the server is on the local machine (localhost), on the local network, or across a wide area. The size of your cache generally doesn't affect the lookup speed.
So, if your application is using MANY objects per processing unit (per request, method, or what-have-you), then it's generally better to define your cache in a way which lowers total number of hits to the cache while at the same time trying not to duplicate cache data. Like everything else, it's a balance.
i.e. If you have a web request which pulls a list of blog posts, it would be more beneficial to cache the entire object list as one memcached key, rather than (and this is a somewhat bad example, obviously) caching an array of cache keys for that list, which relate to individually memcached objects.
The less processing you have to do of the cached values, the better. So why not just dump them into the cache individually?
I would say you should store values individually and use some kind of helper class to retrieve values with multiget and generate a complex dataobject for you.
It depends on what are those numbers. If you could, for example, group them in ranges, then you could optimize the storage. If you could hash them, into a map, or hashtable and store that map serialized in memcached would be good to.
Anyway, you can save many little keys, just make sure you configure the slabs to have chunks with small size, so you will not waste memory space.