Memcached, Locking and Race Conditions - memcached

We are trying to update memcached objects when we write to the database to avoid having to read them from database after inserts/updates.
For our forum post object we have a ViewCount field containing the number of times a post is viewed.
We are afraid that we are introducing a race condition by updating the memcached object, as the same post could be viewed at the same time on another server in the farm.
Any idea how to deal with these kind of issues - it would seem that some sort of locking is needed but how to do it reliably across servers in a farm?

If you're dealing with data that doesn't necessarily need to be updated realtime, and to me the view count is one of them, then you could add an expires field to the objects that are stored in memcache.
Once that expiration happens, it'll go back to the database and read the new value, but until then it will leave it alone.
Of course for new posts you may want this updated more often, but you can code for this.
Memcache only stores one copy of your object in one of its instances, not in many of them, so I wouldn't worry about object locking or anything. That is for the database to handle, not your cache.
Edit:
Memcache offers no guarantee that when you're getting and setting from varied servers that your data won't get clobbered.
From memcache docs:
A series of commands is not atomic. If you issue a 'get' against an item, operate on the data, then wish to 'set' it back into memcached, you are not guaranteed to be the only process working on that value. In parallel, you could end up overwriting a value set by something else.
Race conditions and stale data
One thing to keep in mind as you design your application to cache data, is how to deal with race conditions and occasional stale data.
Say you cache the latest five comments for display on a sidebar in your application. You decide that the data only needs to be refreshed once per minute. However, you neglect to remember that this sidebar display is renderred 50 times per second! Thus, once 60 seconds rolls around and the cache expires, suddenly 10+ processes are running the same SQL query to repopulate that cache. Every time the cache expires, a sudden burst of SQL traffic will result.
Worse yet, you have multiple processes updating the same data, and the wrong one ends up dating the cache. Then you have stale, outdated data floating about.
One should be mindful about possible issues in populating or repopulating our cache. Remember that the process of checking memcached, fetching SQL, and storing into memcached, is not atomic at all!

I'm thinking - could a solution be to store viewcount seperately from the Post object, and then do an INCR on it. Of course this would require reading 2 seperate values from memcached when displaying the information.

memcached operations are atomic. the server process will queue the requests and serve each one completely before going to the next, so there's no need for locking.
edit: memcached has an increment command, which is atomic. You just have to store the counter as a separate value in the cache.

We encountered this in our system. We modified get so
If the value is unset, it sets it with a flag ('g') and [8] second TTL, and returns false so the calling function generates it.
If the value is not flagged (!== 'g') then unserialize and return it.
If the value is flagged (==='g') then wait 1 second and try again until it's not flagged. It will eventually be set by the other process, or expired by the TTL.
Our database load dropped by a factor of 100 when we implemented this.
function get($key) {
$value=$m->get($key);
if ($value===false) $m->set($key, 'g', $ttl=8);
else while ($value==='g') {
sleep(1);
$value=$m->get($key);
}
return $value;
}

Related

solution to synch back cache value to database?

below is the scenario:
here is a access statistic system, just like Blogger's overviewstats function.
Statistic data is stored persistent in database(like MySQL), while using a key-value cache(now is memcache) to cache the access counts, each access only update the value in cache.
Now the question is how to synch back the latest count value to database?
A normal solution is to write back after some interval, but memcache will discard items when there is no enough spaces, some updates may lost.
so I think a better solution is if memcache can send a message(like JMS) when discarding an item, and then i can synch that item to database.
It seems that memcache does not provide this function, is there any other key-value cache can do this?
Or is there any better solutions?
Memcached is a cache, so you need to use it as one. When you update the access counts in memcached, you should also enqueue the updates so they can be written asynchronously to the database. That way, counts that fall out of the cache can be reloaded from the database.
I like the idea of memcached enqueuing items that are about to be discarded, but it's probably not going to happen in the main project due to performance considerations.

CacheManager GetData performance issue

We are using Enterprise Library 5 and the CacheManager that it provides in our web application. Everything seems to be working fine up to the point where we start a heavy load test on the application.
We are caching records from the database using a key based on their ID. We are not requesting from cache one item all the time, sometimes we need to get a list of items from the cache. For this we have a LINQ query that makes a Select(e => CacheManager.GetData(id_from_list)) and returns the list of items from the cache. Most of the time this works fine but in heavy loads the GetData method becomes a bottleneck due to the locking that the cache manager is performing both on read and write operations from cache. Basically only one thread can read data from the cache at one time. We did create several cache managers based on the type of the items - this allows several threads to get data from different cache managers but still the issue remains when heavy loads hit the application (one bottleneck per cache manager) - of course it did improve the application up to some point but not enough.
Did someone else encountered the same problem and did you find a way to overcome this?
NOTE: We tried to actually cache lists of items and compose the key from the ids of the items in the list. This actually solved the problem and the cachemanager.getdata is not a bottleneck anymore ... BUT ... obviously this is not a good solution as we could have each item thousands of times in the cache in a lot of lists.
You may consider adapting the CacheManager to use a read/write lock (which I think is much more suitable for this situation) instead of the exclusive locking that it uses now.
http://msdn.microsoft.com/en-us/library/system.threading.readerwriterlock.aspx
Basically, a read/write lock is appropriate when multiple reader threads need simultaneous access to the data, and only the occurrence of a write will cause incoming readers to block.
These have other problems when put under load, however, such as write starvation. Depending on the read/write lock implementation a write will always wait for all reads to finish first - with a constant stream of reads a write will never have a chance to happen.

is memcached just instantiating another virtual operating system?

I have read a few tutorials on memcached and I have a few questions, in order to ease the pain of requests to the default database.
What is being instantiated to allow memcached to operate?
Is it virtual operating systems with say mysql installed or is the database in its entirety being stored in ram?
My other question is say i have a blog and using memcache and a user comes to request data from the browser and the request first checks the memcache for the data and sees that the data exists and is displayed to that user.
What if the data being requested doesn't match what is on the original database because i had updated it myself. how will the cache know that i changed it?
Is it always checking to see if the data on the db is the same as what is cached?
From the memcached front-page:
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
Although memcached is frequently used with MySQL, it has no particular ties to MySQL or any other database. It is just a simple key-value store providing constant time (O(1)) access to data cached by key. The data is stored in memory by the memcached process. (Much of this is explained on the FAQ).
Regarding your second question, it is really your application / your responsibility to ensure that memcached is notified of any changes. You can do this via reasonable expiration periods on your cached data or by using a script or the command line interface to manually purge stale entries. Some frameworks will handle notifying memcached of changes provided the change is made through the framework. Ultimately, if you need to ensure that users always have access to the latest data in real-time, than caching is not a good solution for your problem. Caching works on the principle that it's ok to occasionally serve up stale data -- you should construct your application so that it caches data that can be stale, but always uses look-ups to authoritative sources for data that must be fresh.
1
You will start a memcached server in every machine you need, assigning an amount of memory to dedicate to memcached.
Then with the library memcached you will use the amount of memory on every single server.
NB There is no manner to know in which server a single object will be stored.
2
The mechanism of duplicates is easy: you can set a timeout for the object. When the timeout elapses the system will delete that object.
To store an object you will assign to that object a key as an hash because you don t want that 2 object have the same key.

What's the advantage of deleting a record over updating a record in Memcached?

If the value of a key already is cached and then gets updated in database, we should either invalidate the key in the cache, or update the record in the cache.
Comparing delete with update: Update will have the advantage of saving a potential DB hit in the future.
Then what would be the major advantage of delete over update?
By deleting the item, you aren't forcing a potentially unnecessary load. Say for example that the record gets updated several times in a row before being read from the cache. You will have updated the cache several times for no reason; If you removed the item from the cache after the first update, all of the later updates would execute (not requiring the cache to be populated each time), and then only when the item is actually needed is it loaded from the database and put into the cache.
The major advantage of delete over update is that it is simpler. To update you may need to do the following:
Handle failures of the cas command (probably by falling back to delete)
Keep a copy of the original cas unique value for use in the cas command
Be able to create a complete new value to put in the cache
These requirements restrict the way you can structure your code in significant ways. If your consistency requirements are especially weak you could just slam values into the cache with the set command which would similarly simple to implement.

How should I implement "get objects changed since" pattern with MongoDB?

I have a collection of objects, let's say they are "posts," and those objects can be modified. I'd like to display a list on the client side that updates dynamically. So on the client side, if doing this via polling, the client would invoke an API like:
getPostsChangedSince(serial)
where serial could be a monotonically increasing number, probably a timestamp. The client gets back a list of posts that have changed since that time, stores a new latest-serial, and next time the client polls it requests changes since that latest serial.
I think the basic idea is the same in this question (which is about ASP.NET): How to implement "get latests changed items" with ADO.NET Data Services?
I'm trying to find the best way to implement this in MongoDB.
I like the idea of using the time for the serial, since it automatically works at least mostly correctly even if there are multiple app servers. The serial would be stored in each post object, and updated whenever the object is modified.
The timestamp-based serial could be implemented as:
a Date (I think this is stored as a 64-bit milliseconds since epoch?)
a Timestamp http://www.mongodb.org/display/DOCS/Timestamp+Data+Type
something "by hand" e.g. store milliseconds as a number
Some nice features to have in a solution would include:
ensure that creating then immediately updating an object within the OS timer resolution will still increment the serial despite it being the same time
even better would to be guaranteed monotonic increase globally for all objects, not just guarantee that changing a given object will bump the serial on that object (absent this, getPostsChangedSince() calls probably need a fuzz backward in time, to avoid missing changes - at price of getting some changes twice)
mongodb-side timestamps might be nice because getting the time in the app creates a gap between when you get the time, and when the new object is saved and available in queries
update using findAndModify() with a query including the old serial, so "conflicts" (two changes at once) will throw an error allowing the app to retry
I realize some of the corner cases here are a little bit "academic" and can likely be fudged around in real life.
My approach so far is:
use the Date type for the serial
when modifying an object, get the current time, and if it matches the object's old serial, add 1 millisecond (yes this breaks if you make two modifications quickly without re-fetching from mongodb, but that seems OK)
use findAndModify(), but based on https://jira.mongodb.org/browse/JAVA-276 there may not be a way to detect if it ends up not finding anything to modify (i.e. second change is ignored, in case of conflict)
Questions:
I feel like I should use Timestamp instead; true? Any downsides?
if you had a mongo cluster, might time in milliseconds be more unique and correct than Timestamp's time in seconds plus a number, while with one mongod Timestamp is more unique?
is there a way to detect whether findAndModify() updated anything?
any general advice / experiences with this problem? how would you do it?
Have you considered "externalizing" the serial number generator? Time with MongoDB precision is good, but can become difficult to synchronize when involving multiple machines. One choice is that you can use memcached or something similar which is memory based, extremely fast and can be serialized (memcached has a CAS operation).
So what you would do is store a "seed" in memcached with a key say, counter.
Everytime an app needs to do an insert, it gets the next number from memcached and increments the counter.
On second thoughts, you can even do away with memcached and just use a single row (sorry document) collection that just has the counter. You can get the counter and increment it which will be an extremely fast operation, mimicking memcached.
And then naturally, you can index the data appropriately. However, I am wondering that this would result in the index to be very imbalanced (right-side loped). Depending upon the situation, it might be worthwhile exploring the use of capped collection. So when you insert data into your main collection, also insert it into the capped collection and read data from that collection.
You could continue to use your regular collection, as you do now, and after each update additionally insert the ID of the post into a special TTL collection. See http://docs.mongodb.org/manual/tutorial/expire-data/ for more info on using such a collection. Mongo will take care of all timing issues, you don't need to worry about serial numbers, and you can very quickly access time based lists of objects by their IDs.
Caveat:
use the blocking form of findAndModify, to ensure the changes have really been processed:
Blocking/Safe Writes
Unless you specify the "new" parameter as true the write operation will not block, and will not return an error (if there is one). If you do want the "new" document returned then the operation will wait until the write is done to return the new document, or an error.
For a "safe" (blocking) write operation you must call getLastError (if not using "new").