Couchbase - Value Eviction to Full Eviction change for huge data base - spring-data

We are having production servers with high volume of data with Value Eviction Buckets. Since we are running out of memory we have decided to change the eviction mode to Full Eviction. If we do this
Is there any impact for live operations ?
Is there any process running ? (Ex: like re balancing)
What are the pros and cons ?

Yes there are. There are not many, but that operation requires the memcached processes to be restarted on all nodes at the same time and warm up the caches. So you will incur downtime of course. How much depends on a few factors.
Not that I can think of. It just has to restart the processes.
Pros: You have more room in RAM as the meta-data is ejected now in addition to values. Cons: If you have it in your code to do any operation that checks for the existence of an object first, it will be much slower. I will give you an example. If you do an upsert, the DB has to check if that object exists first as part of the process. If you are running value eviction, it checks the for the metadata object in RAM which is super quick. That object ID is either there or not. If you are running with full eviction, now Couchbase has to go to disk to look through the meta-data there. As you might imagine, there is a penalty for that, which depending on some factors could be large.
IMO, running out of memory is not a good enough reason to move to full eviction. You need to have a functional reason. Without knowing more information (resident ratios, RAM size, cache sizes, etc. Etc.), you are probably better off adding more servers or larger ones, your choice. Keeping Couchbase properly sized, like most databases, but especially Couchbase is critical to a well functioning system. If you have an Enterprise contract with Couchbase, their Support team can help you with this. If not, read the documentation on this REALLY carefully before you turn on this feature. Like I said, have more than "I am running out of RAM" as the reason you are changing how the DB works, otherwise you may be doing more harm than good.

Related

What does "mostly-memory" mean?

ArangoDB describus itself as a "mostly-memory" database, but I am not clear on the implications. The FAQ gives very little detail:
ArangoDB is a “mostly memory” database, which means that it
appreciates RAM very much and is most performing when it is not forced
to swap data to the hard disk.
I am looking at running ArangoDB on a Raspberry Pi to serve two or three users. What are the implications of "mostly-memory" in such a context?
If it is unplugged for some reason will I lose data?
It's mostly working in memory, but is also doing some work on disks.
More precisely, it works with memory-mapped files, so all the operations will eventually be saved to disk (or equivalent long-term storage), but because it doesn't wait (by default) for the persistence to disk to happen, it can benefit in performance from this.
The implications is that if you use this default you get better performance than you might expect otherwise, but if something brings it down before the save has happened (especially a sudden power failure) then you could lose data or have a corrupt database.
If you configure a collection for immediate synchronisation you protect against this, but the performance is affected.

Key Value storage without a file system?

I am working on an application, where we are writing lots and lots of key value pairs. On production the database size will run into hundreds of Terabytes, even multiple Petabytes. The keys are 20 bytes and the value is maximum 128 KB, and very rarely smaller than 4 KB. Right now we are using MongoDB. The performance is not very good, because obviously there is a lot of overhead going on here. MongoDB writes to the file system, which writes to the LVM, which further writes to a RAID 6 array.
Since our requirement is very basic, I think using a general purpose database system is hitting the performance. I was thinking of implementing a simple database system, where we could put the documents (or 'values') directly to the raw drive (actually the RAID array), and store the keys (and a pointer to where the value lives on the raw drive) in a fast in-memory database backed by an SSD. This will also speed-up the reads, as all there would not be no fragmentation (as opposed to using a filesystem.)
Although a document is rarely deleted, we would still have to maintain a pool of free space available on the device (something that the filesystem would have provided).
My question is, will this really provide any significant improvements? Also, are there any document storage systems that do something like this? Or anything similar, that we can use as a starting poing?
Apache Cassandra jumps to mind. It's the current elect NoSQL solution where massive scaling is concerned. It sees production usage at several large companies with massive scaling requirements. Having worked a little with it, I can say that it requires a little bit of time to rethink your data model to fit how it arranges its storage engine. The famously citied article "WTF is a supercolumn" gives a sound introduction to this. Caveat: Cassandra really only makes sense when you plan on storing huge datasets and distribution with no single point of failure is a mission critical requirement. With the way you've explained your data, it sounds like a fit.
Also, have you looked into redis at all, at least for saving key references? Your memory requirements far outstrip what a single instance would be able to handle but Redis can also be configured to shard. It isn't its primary use case but it sees production use at both Craigslist and Groupon
Also, have you done everything possible to optimize mongo, especially investigating how you could improve indexing? Mongo does save out to disk, but should be relatively performant when optimized to keep the hottest portion of the set in memory if able.
Is it possible to cache this data if its not too transient?
I would totally caution you against rolling your own with this. Just a fair warning. That's not a knock at you or anyone else, its just that I've personally had to maintain custom "data indexes" written by in house developers who got in way over their heads before. At my job we have a massive on disk key-value store that is a major performance bottleneck in our system that was written by a developer who has since separated from the company. It's frustrating to be stuck such a solution among the exciting NoSQL opportunities of today. Projects like the ones I cited above take advantage of the whole strength of the open source community to proof and optimize their use. That isn't something you will be able to attain working on your own solution unless you make a massive investment of time, effort and promotion. At the very least I'd encourage you to look at all your nosql options and maybe find a project you can contribute to rather than rolling your own. Writing a database server itself is definitely a nontrivial task that needs a huge team, especially with the requirements you've given (but should you end up doing so, I wish you luck! =) )
Late answer, but for future reference I think Spider does this

Extremely high QPS - DynamoDB vs MongoDB vs other noSQL?

We are building a system that will need to serve loads of small requests from day one. By "loads" I mean ~5,000 queries per second. For each query we need to retrieve ~20 records from noSQL database. There will be two batch reads - 3-4 records at first and then 16-17 reads instantly after that (based on the result of first read). That would be ~100,000 objects to read per second.
Until now we were thinking about using DynamoDB for this as it's really easy to start with.
Storage is not something I would be worried about as the objects will be really tiny.
What I am worried about is cost of reads. DynamoDB costs $0.0113 per hour per 100 eventually consistent (which is fine for us) reads per second. That is $11,3 per hour for us provided that all objects are up to 1KB in size. And that would be $5424 per month based on 16 hours/day average usage.
So... $5424 per month.
I would consider other options but I am worried about maintenance issues, costs etc. I have never worked with such setups before so your advice would be really valuable.
What would be the most cost-effective (yet still hassle-free) solution for such read/write intensive application?
From your description above, I'm assuming that your 5,000 queries per second are entirely read operations. This is essentially what we'd call a data warehouse use case. What are your availability requirements? Does it have to be hosted on AWS and friends, or can you buy your own hardware to run in-house? What does your data look like? What does the logic which consumes this data look like?
You might get the sense that there really isn't enough information here to answer the question definitively, but I can at least offer some advice.
First, if your data is relatively small and your queries are simple, save yourself some hassle and make sure you're querying from RAM instead of disk. Any modern RDBMS with support for in-memory caching/tablespaces will do the trick. Postgres and MySQL both have features for this. In the case of Postgres make sure you've tuned the memory parameters appropriately as the out-of-the-box configuration is designed to run on pretty meager hardware. If you must use a NoSQL option, depending on the structure of your data Redis is probably a good choice (it's also primarily in-memory). However in order to say which flavor of NoSQL might be the best fit we'd need to know more about the structure of the data that you're querying, and what queries you're running.
If the queries boil down to SELECT * FROM table WHERE primary_key = {CONSTANT} - don't bother messing with NoSQL - just use an RDBMS and learn how to tune the dang thing. This is doubly true if you can run it on your own hardware. If the connection count is high, use read slaves to balance the load.
Long-after-the-fact Edit (5/7/2013):
Something I should've mentioned before: EC2 is a really really crappy place to measure performance of self-managed database nodes. Unless you're paying out the nose, your I/O perf will be terrible. Your choices are to either pay big money for provisioned IOPS, RAID together a bunch of EBS volumes, or rely on ephemeral storage whilst syncing a WAL off to S3 or similar. All of these options are expensive and difficult to maintain. All of these options have varying degrees of performance.
I discovered this for a recent project, so I switched to Rackspace. The performance increased tremendously there, but I noticed that I was paying a lot for CPU and RAM resources when really I just need fast I/O. Now I host with Digital Ocean. All of DO's storage is SSD. Their CPU performance is kind of crappy in comparison to other offerings, but I'm incredibly I/O bound so I Just Don't Care. After dropping Postgres' random_page_cost to 2, I'm humming along quite nicely.
Moral of the story: profile, tune, repeat. Ask yourself what-if questions and constantly validate your assumptions.
Another long-after-the-fact-edit (11/23/2013): As an example of what I'm describing here, check out the following article for an example of using MySQL 5.7 with the InnoDB memcached plugin to achieve 1M QPS: http://dimitrik.free.fr/blog/archives/11-01-2013_11-30-2013.html#2013-11-22
By "loads" I mean ~5,000 queries per second.
Ah that's not so much, even SQL can handle that. So you are already easily within the limits of what most modern DBs can handle. However they can only handle this with the right:
Indexes
Queries
Server Hardware
Splitting of large data (you might require a large amount of shards with relatively low data each, dependant here so I said "might")
That would be ~100,000 objects to read per second.
Now that's more of a high load scenario. Must you read these in such a fragmented manner? If so then (as I said) you may require to look into spreading the load across replicated shards.
Storage is not something I would be worried about as the objects will be really tiny.
Mongo is aggresive with disk allocation so even with small objects it will still pre-allocate a lot of space, this is something to bare in mind.
So... $5424 per month.
Oh yea the billing thrills of Amazon :\.
I would consider other options but I am worried about maintenance issues, costs etc. I have never worked with such setups before so your advice would be really valuable.
Now you hit the snag of it all. You can setup your own cluster but then you might end up paying that much in money and time (or way more) for the servers, people, admins and your own mantenance time. This is one reason why DynamoDB really shines here. For large setups who are looking to take the load and pain and stress of server management (trust me it is really painful, if your a Dev you might as well change your job title to server admin from now on) off of the company.
Considering to setup this yourself you would need:
A considerable amount of EC instances (dependant upon data and index size but I would say close to maybe 30?)
A server admin (maybe 2, maybe freelance?)
Both of which could set you back 100's of thousands of pounds a year, I would personally bet for the managed approach if it fits your needs and budget. When your need grows beyond what managed Amazon DB can give you then move to your infrastructure.
Edit
I should amend that the cost effectiveness was done with quite a few black holes for example:
I am unsure of the amount of data you have
I am unsure of writes
Both of these contribute me to place a scenario of:
Massive writes (about as much as your reading)
Massive data (lots)
Here is what I recommend in sequence.
Identify your use case and choose the correct db. We test MySQL and MongoDb regularly for all kinds of workloads (OLTP, Analytics, etc). In all cases we have tested with, MySQL outperforms MongoDb and is cheaper ($/TPS) compared to MongoDb. MongoDb has other advantages but that is another story ... since we are talking about performance here.
Try to cache your queries in RAM (by provisioning adequate RAM).
If you are bottle necked on RAM, then you can try a SSD caching solution which takes advantage of ephemeral SSD. This works if your workload is cache friendly. You can save loads of money as ephemeral SSD is typically not charged by the cloud provider.
Try PIOPS/RAID or a combination to create adequate IOPS for your application.

mongoDB vs relational databases when data can't fit into memory?

First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)
MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.
This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).
In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes
If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.

Which NoSQL database for extremely high volumes of data

I'm looking at NoSQL for extremely high volumes of data. We're storing cached versions of web page text in MySQL at the moment, but it seems like the database will get huge very quickly.
My requirements are:
Durability, must not lose data on flushes/writes
Very fast read, reasonably fast write
Fully consistent replication
Preferably, in-memory plus an eventual disk write
I'm looking at: MongoDB, Redis, Raik, and Cassandra right now.
Which best fits my requirements?
I have experience with Redis and MongoDB, but would not recommend either for your use case. Redis is awesome in every regard, but since it's RAM-only and has no clustering features (yet, they are in development), it doesn't scale very well. MongoDB I wouldn't ever use again for anything that needs anything but a small replica set.
Basically, MongoDB is immature and completely unsuitable for any kind of high volume, high performance requirements. It has a global write lock which is held during disk flushes, which means that performance can vary wildly depending on what you do. In practice it makes updates that grow documents impossible, and you need to be very careful with deletes, too. Speaking of deletes, they fragment the database severely, so if you do a lot of deletes your performance is going to suffer.
Sharding in 1.8.0 through 1.8.1 was a disaster. There were complete show stopper bugs that should never have made it into a stable release. Configuration wasn't flushed properly and it was very easy to get your database into a bad state so that chunks never moved off of the primary shard. 1.8.2 solves most of them and seems more stable, but I don't trust the sharding implementation one bit. Add to this that sharding is hard even when everything works, it's not always easy to select a natural shard key, and if you don't sharding will cause you much grief.
MongoDB is really easy to work with and the feature set is really nice. The documentation, the drivers and the community are all great. MongoDB works super as a replacement for MySQL, but don't use it for anything that needs to scale out.
We're currently looking at moving to Cassandra. I find the dynamo model (e.g. no master nodes; write and read anywhere; simply add nodes to grow the cluster) compelling and the features are more or less right for us. The data model is schema less just like MongoDB, although a little more limited (you can choose between one or two level hashes, basically). I'm sure the community is good once you get into it, but so far I find it hard to find good information on how to solve common problems, and the documentation is lacking. Most of the information you find on blogs is a year old, and a lot of things have happened since then (0.7 and 0.8 seem to be really significant updates both, but most things you find are about 0.6). The drivers are also not very mature or well documented, from what I've seen so far, and everyone seems to be squabbling about whether Thrift, Avro or CQL is what should be used (and that has changed from 0.6 to 0.7 to 0.8).
Riak is interesting, for the same reasons as Cassandra, but for us a pure key-value-store is not enough, we need to be able to update without first doing a read. With Riak this isn't possible since the values are just blobs. This sounds like it wouldn't be an issue for you though.
HBase is another contender. It seems like a pain to set up and run because of the many different pieces, ZooKeeper, HDFS, etc. But the data model is similar to Cassandra (columnar, i.e. one level hashes), which works well for us, but may not be important for you. It seems tried and true, but as with MongoDB you have to watch out for sharding issues, you must put some thought into your keys or you get into trouble.
There is also CouchDB, Project Voldemort and countless other possible choices. I think that if you are serious about "extremely high volumes of data" then it's between Cassandra, Riak and HBase. Strike Riak if pure key-value-storage isn't enough. Depending on what you mean by "fully consistent replication" then Cassandra and Riak are out, because there is a possibility (not necessarily big, and tunable) of reading a stale value.
In the end you obviously have to try it out on your particular use case, so all you really should take home from this answer is: don't bother with MongoDB.
Store the cached versions in MemCache instead of MySQL. It will eliminate most writes. Writing to MySQL is bad, because it kills the query cache. When you cache the pages in MemCache, you will have far less writes to the database, and you'll have less reading pressure too. You can cache the result of complex queries, or cache entire pages as you like.
Maybe it won't be as fast as Cassandra, but it will give you an enormous boost compared to your current situation with only MySQL. And you won't have to rewrite your entire application.
memcachedb - memcached protocol, BDB storage, replication etc
Handlersocket - MySql InnoDB plugin.
Oracle memcached InnoDB plugin
RavenDB can store up to 16TB of data per node, and you can have several nodes per machine acting as one database using its built-in sharding support. Thats as huge as it gets.
Durability, fastness, replication is all there, and running in memory is supported too (but not recommended if you want to scale to 16TB per node).
For extremely high volume data, it's clear that Cassandra and hadoop/hbase are far superior than all others for this task. Cassandra proved itself on large clusters like 400 nodes. rdms dbs cannot scale easily, also mongo has some problems when node counts start to increase http://www.nosqlbenchmarking.com/2011/05/paper-on-elasticity-and-scalability-for-acm-socc-2011/
Serdar