Heroku Postgres RAM for cache vs Memcache RAM - postgresql

I have a web app on Heroku and I'm trying to understand the difference/ trade-offs between adding a Memcached instance with 1GB of RAM and adding 1GB of RAM to my Postgres server.
If I added a Memcached instance, I would probably use Johnny Cache (for Django - http://packages.python.org/johnny-cache/).
Should I expect similar performance improvement from the two options? In general, what is the advantage of using memcache vs. increasing the size of the Postgres cache. (I understand that people often run memcache on the DB server so there must be one).
I appreciate this is probably a very naive question but I haven't been able to find anything to clear up my confusion through Google.

Postgres for best performance needs enough cache to keep most frequently used object (indexes, tables). So there is a tipping point in setting of shared_buffers. After that point,
increasing shared buffers does not help much.
It is good to leave some part of RAM for filesystem-level caching.
For much more see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
As for memcache, it's totally different beast... It can be used directly from application to have ultra-fast non-persistent key-value storage.
All three traits make memcached differ from relational database (RDB).
ultra-fast (RDB are not)
non-persistent (RDB are)
key-value only (RDB much better)

Related

design mongodb to load entire content in memory

I am involved in a project where they get enough RAM to store the entire database in memory. According to the manager, that is what 10Gen recommended. This is counter intuitive. Is that really the way you want to use Mongodb?
It is not counter intuitive... I find it quite intuitive, actually.
In How much faster is the memory usually than the disk? you can read:
(...) memory is only about 6 times faster when you're doing sequential
access (350 Mvalues/sec for memory compared with 58 Mvalues/sec for
disk); but it's about 100,000 times faster when you're doing random
access.
So if you can fit all your data in RAM, it is quite good because you are going to be really fast reading your data.
Regarding MongoDB, from the FAQ's:
It’s certainly possible to run MongoDB on a machine with a small
amount of free RAM.
MongoDB automatically uses all free memory on the machine as its
cache. System resource monitors show that MongoDB uses a lot of
memory, but its usage is dynamic. If another process suddenly needs
half the server’s RAM, MongoDB will yield cached memory to the other
process.
Technically, the operating system’s virtual memory subsystem manages
MongoDB’s memory. This means that MongoDB will use as much free memory
as it can, swapping to disk as needed. Deployments with enough memory
to fit the application’s working data set in RAM will achieve the best
performance.
The problem is that you usually have much more data than memory available. And then you have to go to disk, and disk I/O is slow. Regarding database performance, avoiding full scan queries is key (much more important when accessing to disk). Therefore, if your data set does not fit in memory, you should aim at having indexes for the vast majority of your access patterns and try to fit those indexes in memory:
If you have created indexes for your queries and your working data set
fits in RAM, MongoDB serves all queries from memory.
It all depends on the size of your database. I am guessing that you said your database was actually quite small, otherwise I cannot see how someone at 10gen gave such advice, I mean not even #Stennie gives such advice (he is 10gen by the way).
Even if your database is small I don't see how the manager recommended that. MongoDB does not do memory management of its own as such it does not "pin" data into pages like memcached does or other memory based databases do.
This means that the paging of mongods data can be quite unpredicatable, a.k.a you will spend more time trying to keep things in RAM than paging in data. This is why it is better to just make sure your working set fits and it can loaded with speed, such things are based upon your hardware and queries.
#Stennies comment pretty much sums up the stance you should be taking with MongoDB.

understand MongoDB cache system

This is a basic question, but very important, and i am not sure to really get the point.
On the official documentation we can read
MongoDB keeps all of the most recently used data in RAM. If you have created indexes for your queries and your working data set fits in RAM, MongoDB serves all queries from memory.
The part i am not sure to understand is
If you have created indexes for your queries and your working data set fits in RAM
what does mean "indexes" here?
For example, if i update a model, then i query it, because i have updated it, it's now in RAM so it will come from the memory, but this is not very clear in my mind.
How can we be sure that datas we query will come from the memory or not? I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
How do you globally advise to use MongoDB for huge traffic?
Note: This was written back in 2013 when MongoDB was still quite young, it didn't have the features it does today, while this answer still holds true for mmap, it does not for the other storage technologies MongoDB now implements, such as WiredTiger, or Percona.
A good place to start to understand exactly what is an index: http://docs.mongodb.org/manual/core/indexes/
After you have brushed up on that you will udersand why they are so good, however, skipping forward to some of the more intricate questions.
How can we be sure that datas we query will come from the memory or not?
One way is to look at the yields field on any query explain(). This will tell you how many times the reader yielded its lock because data was not in RAM.
Another more indepth way is to look on programs like mongostat and other such programs. These programs will tell you about what page faults (when data needs to be paged into RAM from disk) are happening on your mongod.
I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
This is actually incorrect. It is easier to just say that MongoDB does this but in reality it does not. It is in fact the OS and its own paging algorithms, usually the LRU, that does this for MongoDB. MongoDB does cache index plans for a certain period of time though so that it doesn't have to constantly keep checking and testing for indexes.
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
Not sure how you expect that to work...I mean the two do quite different things and if you intend to read your data from MongoDB into your application on startup into that var then I definitely would not recommend it.
Besides OS algorithms for memory management are extremely mature and fast, so it is ok.
How do you globally advise to use MongoDB for huge traffic?
Hmm, this is such a huge question. Really I would recommend you Google a little in this subject but as the documentation states you need to ensure your working set fits into RAM for one.
Here is a good starting point: What does it mean to fit "working set" into RAM for MongoDB?
MongoDB attempts to keep entire collections in memory: it memory-maps each collection page. For everything to be in memory, both the data pages, and the indices that reference them, must be kept in memory.
If MongoDB returns a record, you can rest assured that it is now in memory (whether it was before your query or not).
MongoDB doesn't keep a "cache" of records in the same way that, say, a web browser does. When you commit a change, both the memory and the disk are updated.
Mongo is great when matched to the appropriate use cases. It is very high performance if you have sufficient server memory to cache everything, and declines rapidly past that point. Many, many high-volume websites use MongoDB: it's a good thing that memory is so cheap, now.

Using memcache infront of a mongodb server

I am trying to understand how mongo's internal cache works and if it does eliminate using memcache. Our database size is around 200G and index fits in the memory but after the index not much free memory left on the server.
One of my colleague says mongo's internal cache will be as fast as memcache so no need to introduce another level of complexity by using memcache.
The scenario in my head is when we read the data from db, it's saved in memcache and next time it's directly read from the cache instead of going back to db server. If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
I have been reading about this but couldn't convince myself yet. So I'd really appreciate if someone could shed some light on this.
First thing is that a cache storage is different to a database. So MongoDB and SQL are different in purpose and usage when compared to Memcache.
Memcache is really good at lowering working set sizes for queries. For example: imagine a huge aggregated query with subselects and CASE statements and what not in SQL (think of the most complex query you can), doing this query in realtime all the time could cause the computer(s) to "thrash" (not to mention the problems client side).
However as everyone knows you need only summarise this query to another collection/table for it to be instantly faster. The real speed of memcache comes from the fact that it is a in memory key value store. This is where MongoDB could fail in speed because it is not memory stored, it is memory mapped but not stored.
MongoDB does no self caching, providing the query is "hot" and in LRU (this is where your working set comes in) you shouldn't notice much of a difference in response times. A good way to ensure a query is "hot" is to run it. Some people have a script of their biggest queries that they run to warm up the cache.
As I said memcache is a cache layer this is why:
If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
Makes me die a little inside. Many do blur the line between the DB and the cache layer.

Which NoSQL database for extremely high volumes of data

I'm looking at NoSQL for extremely high volumes of data. We're storing cached versions of web page text in MySQL at the moment, but it seems like the database will get huge very quickly.
My requirements are:
Durability, must not lose data on flushes/writes
Very fast read, reasonably fast write
Fully consistent replication
Preferably, in-memory plus an eventual disk write
I'm looking at: MongoDB, Redis, Raik, and Cassandra right now.
Which best fits my requirements?
I have experience with Redis and MongoDB, but would not recommend either for your use case. Redis is awesome in every regard, but since it's RAM-only and has no clustering features (yet, they are in development), it doesn't scale very well. MongoDB I wouldn't ever use again for anything that needs anything but a small replica set.
Basically, MongoDB is immature and completely unsuitable for any kind of high volume, high performance requirements. It has a global write lock which is held during disk flushes, which means that performance can vary wildly depending on what you do. In practice it makes updates that grow documents impossible, and you need to be very careful with deletes, too. Speaking of deletes, they fragment the database severely, so if you do a lot of deletes your performance is going to suffer.
Sharding in 1.8.0 through 1.8.1 was a disaster. There were complete show stopper bugs that should never have made it into a stable release. Configuration wasn't flushed properly and it was very easy to get your database into a bad state so that chunks never moved off of the primary shard. 1.8.2 solves most of them and seems more stable, but I don't trust the sharding implementation one bit. Add to this that sharding is hard even when everything works, it's not always easy to select a natural shard key, and if you don't sharding will cause you much grief.
MongoDB is really easy to work with and the feature set is really nice. The documentation, the drivers and the community are all great. MongoDB works super as a replacement for MySQL, but don't use it for anything that needs to scale out.
We're currently looking at moving to Cassandra. I find the dynamo model (e.g. no master nodes; write and read anywhere; simply add nodes to grow the cluster) compelling and the features are more or less right for us. The data model is schema less just like MongoDB, although a little more limited (you can choose between one or two level hashes, basically). I'm sure the community is good once you get into it, but so far I find it hard to find good information on how to solve common problems, and the documentation is lacking. Most of the information you find on blogs is a year old, and a lot of things have happened since then (0.7 and 0.8 seem to be really significant updates both, but most things you find are about 0.6). The drivers are also not very mature or well documented, from what I've seen so far, and everyone seems to be squabbling about whether Thrift, Avro or CQL is what should be used (and that has changed from 0.6 to 0.7 to 0.8).
Riak is interesting, for the same reasons as Cassandra, but for us a pure key-value-store is not enough, we need to be able to update without first doing a read. With Riak this isn't possible since the values are just blobs. This sounds like it wouldn't be an issue for you though.
HBase is another contender. It seems like a pain to set up and run because of the many different pieces, ZooKeeper, HDFS, etc. But the data model is similar to Cassandra (columnar, i.e. one level hashes), which works well for us, but may not be important for you. It seems tried and true, but as with MongoDB you have to watch out for sharding issues, you must put some thought into your keys or you get into trouble.
There is also CouchDB, Project Voldemort and countless other possible choices. I think that if you are serious about "extremely high volumes of data" then it's between Cassandra, Riak and HBase. Strike Riak if pure key-value-storage isn't enough. Depending on what you mean by "fully consistent replication" then Cassandra and Riak are out, because there is a possibility (not necessarily big, and tunable) of reading a stale value.
In the end you obviously have to try it out on your particular use case, so all you really should take home from this answer is: don't bother with MongoDB.
Store the cached versions in MemCache instead of MySQL. It will eliminate most writes. Writing to MySQL is bad, because it kills the query cache. When you cache the pages in MemCache, you will have far less writes to the database, and you'll have less reading pressure too. You can cache the result of complex queries, or cache entire pages as you like.
Maybe it won't be as fast as Cassandra, but it will give you an enormous boost compared to your current situation with only MySQL. And you won't have to rewrite your entire application.
memcachedb - memcached protocol, BDB storage, replication etc
Handlersocket - MySql InnoDB plugin.
Oracle memcached InnoDB plugin
RavenDB can store up to 16TB of data per node, and you can have several nodes per machine acting as one database using its built-in sharding support. Thats as huge as it gets.
Durability, fastness, replication is all there, and running in memory is supported too (but not recommended if you want to scale to 16TB per node).
For extremely high volume data, it's clear that Cassandra and hadoop/hbase are far superior than all others for this task. Cassandra proved itself on large clusters like 400 nodes. rdms dbs cannot scale easily, also mongo has some problems when node counts start to increase http://www.nosqlbenchmarking.com/2011/05/paper-on-elasticity-and-scalability-for-acm-socc-2011/
Serdar

Is Memcache recommended when using MongoDB?

I would like to know if Memcache is recommended when using a NoSQL database like mongoDB.
The concept of using memcache stems from the idea that you have "extra RAM" sitting around somewhere. Both MongoDB and MySQL (and most DBs) will take every meg of RAM that they can get.
In the case of the very common MySQL / Memcache, it is very well documented that using Memcache is more about reducing query load on the server than it is about speeding up queries. A good memcache implementation basically just tries to keep the most common data in memory so that the database server can churn away on bigger stuff.
In fact, it's been my experience that use of memcache generally becomes a reliance on memcache to maintain system performance.
So back to the original question, where do you have extra RAM?
If you have extra RAM on web servers, you may be able to use Memcache. Of course, you could also run Mongo locally on the web server. Just slave the data you need from the master.
If you have extra RAM on other computers, then there's not really a point in using memcache. Just add more nodes to your MongoDB replica set or shard. This is where MongoDB actually shines. Because of sharding / replication, you can add more RAM to Mongo Horizontally to increase performance. With SQL it's very difficult to "just add more servers" because joins don't scale very well. But with Mongo, it's quite possible to simply "add more nodes" to a problem.
MongoDB stores everything in memory anyway and works in a similar vein, being a key-value based system, however I believe MongoDB is more flexible, as it allows for storing BSON objects within themselves.
(Just for clarification, MongoDB uses BSON, a specialised form of JSON, for storing all its data, which includes objects within objects.)
At first no. If you run into performance problems later add a caching layer (memcache). But you won't gain anything if you're going to use Redis for example, as Redis already stores everything in memory.
The answer would depend on your use cases.
In general, accessing RAM is orders of magnitude faster than accessing disk.
Even the fastest SSD drives are about 100 times slower to access than RAM.
Now, I don't know if Mongo has a caching system in place (most likely it does), or what the eviction policy is, but as a programmer i would prefer a cache where i can store/retrieve and delete items at will. Therefore i would prefer using a caching solution even with Mongo.
In summary, it really depends what you are using these solutions for. There is no one answer to cover all possible uses.