Memcached-like key-value store with no automatic data removal - memcached

In my Python app I need to share a key/value store among a few processes - one is updating the data, other processes are only retrieving it at random times. Persistence is not required.
My initial idea was to use memcached, but it seems to have some LRU mechanism to remove old data when it's short on RAM. I'd much prefer to get an error in such case.
Obviously, memcached was optimized to be a cache system, while what I need is simply a network-accessible hash table. I could implement something simple from scratch, but why reinvent the wheel?

Run memcached with the -M option.
-M return error on memory exhausted (rather than removing items)

Related

Should I have two seperate mongoDB instances?

My server queries the db often.
But more often than not, the query retrieves unchanged data.
Therefore I would like to create and store a cached result.
My main mongoDB is stored in a remote address, and therefore takes slightly longer to respond as compared to a local mongoDB instance. I thought it would be beneficial to have therefore an additional, smaller, more static mongoDB running on localhost.
Such that, real-time queries will run on the remote main DB, and smaller, time efficient queries will run on the cached collections in localhost for optimizing speed.
Is this something that can be done?
Is it something people recommend to avoid?
How would I set two connections, one to my main remote server and one
to my local server?
This seems wrong to me
var mongooseMain = require ('mongoose');
var mongooseLocal = require ('mongoose');
mongooseMain.connect(mainDBInfo.url);
mongooseLocal.connect(localDBInfo.url);
In principal, you have the right idea! Caching is a big part of building performant web applications.
First of all, MongoDB wants to cache everything it's using in memory and has a very well designed system of deciding what to keep in memory and what to toss out of it's cache. When an object is asked for that is not in it's cache, it has to read it from disk. When MongoDB reads from disk instead of memory it's called a page fault.
Of course, this memory cache is on a remote server so you still have network latency to deal with.
To eliminate this latency, I would recommend saving the serialized objects you read from often, but rarely write to, in Redis. This is what Redis was built to do. It's basically a dictionary (key:value) which you can easily SET and GET from. You can run redis-server easily on your local machine and even use SETEX to set your objects to the dictionary with some unique key and an expiry for when it should be evicted from the cache.
You can also manually evict objects from the cache whenever they do get updated (I would recommend re-writing them to the cache at this moment). Then, whenever you need an object, just make sure you always try to read from your cache first and fall back to MongoDB if the cache returns null for a key.
Check it out and good luck with your application!

solution to synch back cache value to database?

below is the scenario:
here is a access statistic system, just like Blogger's overviewstats function.
Statistic data is stored persistent in database(like MySQL), while using a key-value cache(now is memcache) to cache the access counts, each access only update the value in cache.
Now the question is how to synch back the latest count value to database?
A normal solution is to write back after some interval, but memcache will discard items when there is no enough spaces, some updates may lost.
so I think a better solution is if memcache can send a message(like JMS) when discarding an item, and then i can synch that item to database.
It seems that memcache does not provide this function, is there any other key-value cache can do this?
Or is there any better solutions?
Memcached is a cache, so you need to use it as one. When you update the access counts in memcached, you should also enqueue the updates so they can be written asynchronously to the database. That way, counts that fall out of the cache can be reloaded from the database.
I like the idea of memcached enqueuing items that are about to be discarded, but it's probably not going to happen in the main project due to performance considerations.

memcached like software with disk persistence

I have an application that runs on Ubuntu Linux 12.04 which needs to store and retrieve a large number of large serialized objects. Currently the store is implemented by simply saving the serialized streams as files, where the filenames equal the md5 hash of the serialized object. However I would like to speed things up replacing the file-store by one that does in-memory caching of objects that are recently read/written, and preferably does the hashing for me.
The design of my application should not get any more complicated. Hence preferably would be a storing back-end that manages a key-value database and caching in an abstracted and efficient way. I am a bit lost with all of the key/value stores that are out there, and much of the topics/information seems to be outdated. I was initially looking at something like memcached+membase, but maybe there are better solutions out there. I looked into redis, mongodb, couchdb, but it is not quite clear to me if they fit my needs.
My most important requirements:
Transparent saving to a persistent store in a way that the most recently written/read objects are quickly available by automatically caching them in memory.
Store should survive a reboot. Hence in memory objects should be saved on disk asap.
Currently I am calculating the md5 manually. It would actually be nicer if the back-end does this for me. Hence the ability to get the hash-key when an object is stored, and be able to retrieve the object later using the hashkey.
Big plus is that if there are packages available for Ubuntu 12.04, either in universe or through launchpad or whatever.
Other than this, the software should preferably be light not be more complicated than necessary (I don't need distributed map-reduce jobs, etc)
Thanks for any advice!
I would normally suggest Redis because it will be fast and in-memory with asynch persistant store. Plus you'll find you can use their different data types for other purposes so not as single-purpose as memcached. As far as auto-hashing, I don't think it does that as you define your own keys when you store objects (as in most of them).
One downside to Redis is if you're storing a TON of binary objects, you'll be limited to available memory in RAM (unless sharding) so could reach performance limitations. In that case you may store objects on file system, hash them, and store keys in Redis and match that to filename stored on file server and you'd be fine.
--
An alternate option would be to check out ElasticSearch which is like Mongo in that it stores objects native as JSON, but it includes the Lucene search engine on top with RESTful API interface. It "warms up" data in memory for fast response, but is also a persistent store and the nicest part is it auto-shards and auto-clusters using multicast to find other nodes.
--
Hope that helps and if so, share the love! ;-)
I'd look at MongoDB. It caches things efficiently using your OS to page data in and out, and is pretty simple to setup. Redis and Memcached won't be good solutions for you because they keep everything in RAM. Other, simpler solutions like LevelDB or BDB would also probably be suitable. I don't think any database going to compute hashes automatically for you. It sounds like you already have code for this though.

why memcached instead of hashmap

I am trying to understand what would be the need to go with a solution like memcached. It may seem like a silly question - but what does it bring to the table if all I need is to cache objects? Won't a simple hashmap do ?
Quoting from the memcache web site, memcache is…
Free & open source, high-performance,
distributed memory object caching
system, generic in nature, but
intended for use in speeding up
dynamic web applications by
alleviating database load.
Memcached is an in-memory key-value
store for small chunks of arbitrary
data (strings, objects) from results
of database calls, API calls, or page
rendering. Memcached is simple yet
powerful. Its simple design promotes
quick deployment, ease of development,
and solves many problems facing large
data caches. Its API is available for
most popular languages.
At heart it is a simple Key/Value
store
A key word here is distributed. In general, quoting from the memcache site again,
Memcached servers are generally
unaware of each other. There is no
crosstalk, no syncronization, no
broadcasting. The lack of
interconnections means adding more
servers will usually add more capacity
as you expect. There might be
exceptions to this rule, but they are
exceptions and carefully regarded.
I would highly recommend reading the detailed description of memcache.
Where are you going to put this hashmap? That's what it's doing for you. Any structure you implement on PHP is only there until the request ends. If you throw stuff in a persistent cache, you can fetch it back out for other requests, instead of rebuilding the data.
I know that this question is rather old, but in addition to being able to share a cache across multiple servers, there is also another aspect that is not mentioned in other answers and is the values expiration.
If you store the values in a HashMap, and that HashMap is bound to the Application context, it will keep growing in size, unless you expire items in some ways. Memcached expires object lazily for maximum performance.
When an item is added to the memcache, it can have an expiration time, for instance 600 seconds. After the object is expired it will just remain there, but if another object asks for it, it will purge it and return null.
Similarly, when memcached memory is full, it will look for the first expired item of adequate size and expire it to make room for the new item. Lastly, it can also happen that the cache is full and there isn't any item to expire, in which case it will replace the least used items.
Using a fully flagded cache system usually allow you to replicate the cache on many servers, or just scale to many server just to scale a lot of parallel requestes, all this remaining acceptable fast in term of reply.
There is an (old) article that compares different caching systems used by php:
https://www.percona.com/blog/2006/08/09/cache-performance-comparison/
Basically, file caching is faster than memcached.
So to answer the question, I believe you would have better performances using a file based cache system.
Here are the results from the tests of the article:
Cache Type Cache Gets/sec
Array Cache 365000
APC Cache 98000
File Cache 27000
Memcached Cache (TCP/IP) 12200
MySQL Query Cache (TCP/IP) 9900
MySQL Query Cache (Unix Socket) 13500
Selecting from table (TCP/IP) 5100
Selecting from table (Unix Socket) 7400

How does memcache store data?

I am a newbie to caching and have no idea how data is stored in caching. I have tried to read a few examples online, but everybody is providing code snippets of storing and getting data, rather than explaining how data is cached using memcache. I have read that it stores data in key, value pairs , but I am unable to understand where are those key-value pairs stored?
Also could someone explain why is data going into cache is hashed or encrypted? I am a little confused between serialising data and hashing data.
A couple of quotes from the Memcache page on Wikipedia:
Memcached's APIs provide a giant hash
table distributed across multiple
machines. When the table is full,
subsequent inserts cause older data to
be purged in least recently used (LRU)
order.
And
The servers keep the values in RAM; if
a server runs out of RAM, it discards
the oldest values. Therefore, clients
must treat Memcached as a transitory
cache; they cannot assume that data
stored in Memcached is still there
when they need it.
The rest of the page on Wikipedia is pretty informative, and it might help you get started.
They are stored in memory on the server, that way if you use the same key/value often and you know they won't change for a while you can store them in memory for faster access.
I'm not deeply familiar with memcached, so take what I have to say with a grain of salt :-)
Memcached is a separate process or set of processes that store a key-value store in-memory so they can be easily accessed later. In a sense, they provide another global scope that can be shared by different aspects of your program, enabling a value to be calculated once, and used in many distinct and separate areas of your program. In another sense, they provide a fast, forgetful database that can be used to store transient data. The data is not stored permanently, but in general it will be stored beyond the life of a particular request (it is possible for Memcached to never store your data, so every read will be a miss, but that's generally an indication that you do not have it set up correctly for your use case).
The data going into cache does not have to be hashed or encrypted (but both things can happen to the data, depending on the caching mechanism.)
Serializing data actually has nothing to do with either concept -- instead, it is the process of changing data from one format (generally one suited for in-memory storage) to another one (generally suitable for storage in a persistent medium.) Another term for this process is marshalling and unmarshalling.