What's the advantage of deleting a record over updating a record in Memcached? - memcached

If the value of a key already is cached and then gets updated in database, we should either invalidate the key in the cache, or update the record in the cache.
Comparing delete with update: Update will have the advantage of saving a potential DB hit in the future.
Then what would be the major advantage of delete over update?

By deleting the item, you aren't forcing a potentially unnecessary load. Say for example that the record gets updated several times in a row before being read from the cache. You will have updated the cache several times for no reason; If you removed the item from the cache after the first update, all of the later updates would execute (not requiring the cache to be populated each time), and then only when the item is actually needed is it loaded from the database and put into the cache.

The major advantage of delete over update is that it is simpler. To update you may need to do the following:
Handle failures of the cas command (probably by falling back to delete)
Keep a copy of the original cas unique value for use in the cas command
Be able to create a complete new value to put in the cache
These requirements restrict the way you can structure your code in significant ways. If your consistency requirements are especially weak you could just slam values into the cache with the set command which would similarly simple to implement.

Related

Stable pagination using Postgres

I want to implement stable pagination using Postgres database as a backend. By stable, I mean if I re-read a page using some pagination token, the results should be identical.
Using insertion timestamps will not work, because clock synchronization errors can make pagination unstable.
I was considering using pg_export_snapshot() as a pagination token. That way, I can reuse it on every read, and the database would guarantee me the same results since I am always using the same snapshot. But the documentation says that
"The snapshot is available for import only until the end of the transaction that exported it."
(https://www.postgresql.org/docs/9.4/functions-admin.html)
Is there any workaround for this? Is there an alternate way to export the snapshot even after the transaction is closed?
You wouldn't need to export snapshots; all you need is a REPEATABLE READ READ ONLY transaction so that the same snapshot is used for the whole transaction. But, as you say, that is a bad idea, because long transactions are quite problematic.
Using insert timestamps I see no real problem for insert-only tables, but rows that get deleted or updated will certainly vanish or move unless you use “soft delete and update” and leave the old values in the table (which gives you the problem of how to get rid of the values eventually). That would be re-implementing PostgreSQL's multiversioning on the application level and doesn't look very appealing.
Perhaps you could use a scrollable WITH HOLD cursor. Then the database server will materialize the result set when the selecting transaction is committed, and you can fetch forward and backward at your leisure. Sure, that will hog server resources, but you will have to pay somewhere. Just don't forget to close the cursor when you are done.
If you prefer to conserve server resources, the obvious alternative would be to fetch the whole result set to the client and implement pagination on the client side alone.

solution to synch back cache value to database?

below is the scenario:
here is a access statistic system, just like Blogger's overviewstats function.
Statistic data is stored persistent in database(like MySQL), while using a key-value cache(now is memcache) to cache the access counts, each access only update the value in cache.
Now the question is how to synch back the latest count value to database?
A normal solution is to write back after some interval, but memcache will discard items when there is no enough spaces, some updates may lost.
so I think a better solution is if memcache can send a message(like JMS) when discarding an item, and then i can synch that item to database.
It seems that memcache does not provide this function, is there any other key-value cache can do this?
Or is there any better solutions?
Memcached is a cache, so you need to use it as one. When you update the access counts in memcached, you should also enqueue the updates so they can be written asynchronously to the database. That way, counts that fall out of the cache can be reloaded from the database.
I like the idea of memcached enqueuing items that are about to be discarded, but it's probably not going to happen in the main project due to performance considerations.

Update or Delete which is fast?

I am using mongoDB for an application. This application requires high frequency of read, write and update.
I am just concerned about update and delete functions. Which one is fast among these two. I am indexing the collection on one attribute. Update and Delete both fulfils my purpose, but I am not sure which one is perfect and have better performance.
I would suggest that rather than deciding on whether you use Update or Delete for your solution, you look more on the SafeMode attribute.
SafeMode.True indicates that you are expecting a response from the server that will contain among other things, a confirmation of whether the command succeeded or failed. This option blocks the execution until you receive a response from the server.
SafeMode.False will not expect any response, and it is basically an optimistic command. You expect for it to work, but have no way to confirm it. Waiting for the response does not block the execution, therefore, you gain performance because all you need to do is to send the request.
Now you need to consider that Deletes will free us space on the server, but you will lose history and traceability of the data. Updates will allow you to keep historic entries, but you will need to make sure your queries exclude the 'marked for deletion' entries.
It is obviously up to you to find whether a Delete or Update is better, but I think the focus should be on whether you use SafeMode true or false to improve performance.
A rather odd question but here are the things you can base your decision on :
Deleting will keep the collection at an optimum size. Updating (I assume you mean something like setting a deleted flag to true) will result in an ever growing collection which eventually will make things slower.
In-place updates (updates that do not result in the document having to be moved due to an increase in size) are always faster than updates or deleted that require documents to be (re)moved.
Safe = false writes will significantly improve throughput of updates and deletes at the expense of not being able to check if the update/remove was succesful.

Memcached, Locking and Race Conditions

We are trying to update memcached objects when we write to the database to avoid having to read them from database after inserts/updates.
For our forum post object we have a ViewCount field containing the number of times a post is viewed.
We are afraid that we are introducing a race condition by updating the memcached object, as the same post could be viewed at the same time on another server in the farm.
Any idea how to deal with these kind of issues - it would seem that some sort of locking is needed but how to do it reliably across servers in a farm?
If you're dealing with data that doesn't necessarily need to be updated realtime, and to me the view count is one of them, then you could add an expires field to the objects that are stored in memcache.
Once that expiration happens, it'll go back to the database and read the new value, but until then it will leave it alone.
Of course for new posts you may want this updated more often, but you can code for this.
Memcache only stores one copy of your object in one of its instances, not in many of them, so I wouldn't worry about object locking or anything. That is for the database to handle, not your cache.
Edit:
Memcache offers no guarantee that when you're getting and setting from varied servers that your data won't get clobbered.
From memcache docs:
A series of commands is not atomic. If you issue a 'get' against an item, operate on the data, then wish to 'set' it back into memcached, you are not guaranteed to be the only process working on that value. In parallel, you could end up overwriting a value set by something else.
Race conditions and stale data
One thing to keep in mind as you design your application to cache data, is how to deal with race conditions and occasional stale data.
Say you cache the latest five comments for display on a sidebar in your application. You decide that the data only needs to be refreshed once per minute. However, you neglect to remember that this sidebar display is renderred 50 times per second! Thus, once 60 seconds rolls around and the cache expires, suddenly 10+ processes are running the same SQL query to repopulate that cache. Every time the cache expires, a sudden burst of SQL traffic will result.
Worse yet, you have multiple processes updating the same data, and the wrong one ends up dating the cache. Then you have stale, outdated data floating about.
One should be mindful about possible issues in populating or repopulating our cache. Remember that the process of checking memcached, fetching SQL, and storing into memcached, is not atomic at all!
I'm thinking - could a solution be to store viewcount seperately from the Post object, and then do an INCR on it. Of course this would require reading 2 seperate values from memcached when displaying the information.
memcached operations are atomic. the server process will queue the requests and serve each one completely before going to the next, so there's no need for locking.
edit: memcached has an increment command, which is atomic. You just have to store the counter as a separate value in the cache.
We encountered this in our system. We modified get so
If the value is unset, it sets it with a flag ('g') and [8] second TTL, and returns false so the calling function generates it.
If the value is not flagged (!== 'g') then unserialize and return it.
If the value is flagged (==='g') then wait 1 second and try again until it's not flagged. It will eventually be set by the other process, or expired by the TTL.
Our database load dropped by a factor of 100 when we implemented this.
function get($key) {
$value=$m->get($key);
if ($value===false) $m->set($key, 'g', $ttl=8);
else while ($value==='g') {
sleep(1);
$value=$m->get($key);
}
return $value;
}

Syncing objects between two disparate systems, best approach?

I am working on syncing two business objects between an iPhone and a Web site using an XML-based payload and would love to solicit some ideas for an optimal routine.
The nature of this question is fairly generic though and I can see it being applicable to a variety of different systems that need to sync business objects between a web entity and a client (desktop, mobile phone, etc.)
The business objects can be edited, deleted, and updated on both sides. Both sides can store the object locally but the sync is only initiated on the iPhone side for disconnected viewing. All objects have an updated_at and created_at timestamp and are backed by an RDBMS on both sides (SQLite on the iPhone side and MySQL on the web... again I don't think this matters much) and the phone does record the last time a sync was attempted. Otherwise, no other data is stored (at the moment).
What algorithm would you use to minimize network chatter between the systems for syncing? How would you handle deletes if "soft-deletes" are not an option? What data model changes would you add to facilite this?
The simplest approach: when syncing, transfer all records where updated_at >= #last_sync_at. Down side: this approach doesn't tolerate clock skew very well at all.
It is probably safer to keep a version number column that is incremented each time a row is updated (so that clock skew doesn't foul your sync process) and a last-synced version number (so that potentially conflicting changes can be identified). To make this bandwidth-efficient, keep a cache in each database of the last version sent to each replication peer so that only modified rows need to be transmitted. If this is going to be a star topology, the leaves can use a simplified schema where the last synced version is stored in each table.
Some form of soft-deletes are required in order to support sync of deletes, however this can be in the form of a "tombstone" record which contains only the key of the deleted row. Tombstones can only be safely deleted once you are sure that all replicas have processed them, otherwise it is possible for a straggling replica to resurrect a record you thought was deleted.
So I think in summary your questions relate to disconnected synchronization.
So here is what I think should happen:
Initial Sync You retrieve the data and any information associated with it (row versions, file checksums etc). it is important you store this information and leave it pristine until the next succesful sync. Changes should be made on a COPY of this data.
Tracking Changes If you are dealing with database rows, the idea is, you basically have to track insert, update and delete operations. If you are dealing with text files like xml, then its slightly more complicated. If it likely that multiple users will edit this file at the same time, then you would have to have a diff tool, so conflicts can be detected in a more granular level (instead of the whole file).
Checking for conflicts Again if you are just dealing with database rows, conflicts are easy to detect. You can have another column that increments whenever the row is updated (i think mssql has this builtin not sure about mysql). So if the copy you have has a different number than what's on the server, then you have a conflict. For files or strings, a checksum will do the job. I suppose you could also use modified date but make sure that you have a very precise and accurate measurement to prevent misses. for example: lets say I retrieve a file and you save it as soon as I retrieved it. Lets say the time difference is a 1 millisecond. I then make changes to file then I try to save it. If the recorded last modified time is accurate only to 10 milliseconds, there is a good chance that the file I retrieved will have the same modified date as the one you saved so the program thinks theres no conflict and overwrites your changes. So I generally don't use this method just to be on the safe side. On the other hand the chances of a checksum/hash collision after a minor modification is close to none.
Resolving conflicts Now this is the tricky part. If this is an automated process, then you would have to assess the situation and decide whether you want to overwrite the changes, lose your changes or retrieve the data from the server again and attempt to redo the changes. Luckily for you, it seems that there will be human interaction. But its still a lot of pain to code. If you are dealing with database rows, you can check each individual column and compare it against the data in the server and present it to the user. The idea is to present conflicts to the user in a very granular way so as to not overwhelm them. Most conflicts have very small differences in many different places so present it to the user one small difference at a time. So for text files, its almost the same but more a hundred times more complicated. So basically you would have to create or use a diff tool (Text comparison is a whole different subject and is too broad to mention here) that lets you know of the small changes in the file and where they are in a similar fashion as in a database: where text was inserted, deleted or edited. Then present that to the user in the same way. so basically for each small conflict, the user would have to choose whether to discard their changes, overwrite changes in the server or perform a manual edit before sending to the server.
So if you have done things right, the user should be given a list of conflicts if there are any. These conflicts should be granular enough for the user to decide quickly. So for example, the conflict is a spelling change from, it would be easier for the user to choose from word spellings in contrast to giving the user the whole paragraph and telling him that there was a change and that they have to decide what to do, the user would then have to hunt for this small misspelling.
Other considerations: Data Validation - keep in mind that you have to perform validation after resolving conflicts since the data might have changed Text Comparison - like I said, this is a big subject. so google it! Disconnected Synchronization - I think there are a few articles out there.
Source: https://softwareengineering.stackexchange.com/questions/94634/synchronization-web-service-methodologies-or-papers