How does the lazy expiration mechanism in memcached operate? - memcached

(First of all, my English is not very good, please)
As we know, memcached provides lazy expiration, and "replaces" LRU data in its slabs, however I'm not very clear how it does this. For example, if a slab is full, but some data in this slab are expired, what will happen when data are added to the slab?
Does memcached find some expired data and replace them with the added data, or
does it replace the LRU data, or
does it do something else?
As far as I know, the lazy expiration is such that memcached is not actively removing expired data from each slab, but instead only removing expired entries when the key of the expired entry is referenced. This is a waste of resources, isn't it?

When an item is requested (a get request) Memcached checks
the expiration time to see if the item is still valid before returning
it to the client.
Similarly when adding a new item to the cache, if the cache is full,
it will look at for expired items to replace before replacing the
least used items in the cache.
So expired items are only purged when a get request is
sent for the expired item or the expired item is cleared because the
storage is needed.

Related

Why memcached set operation is not idempotent?

On the page 2 of the Facebook's paper "Scaling Memcache at Facebook" they said "For write requests,the webserver issues SQL statements to the database and then sends a delete request to memcache that invalidates any stale data. We choose to delete cached data instead of updating it because deletes are idempotent."
Why update/set is not idempotent operation?
Paper can be found here: https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf
If you call one delete after another two times second delete won't make any effect. Update/set here will act differently: while it won't change value associated with the key, it will update the last access time for the key changing the logic of when the key will be evicted. In this sense delete key operation is idempotent while update key's value is not.
E.g. in the paper they don't want to keep a key in the cache if no one ever tried to read it (even if there are lots of writes for that key in the database).

Getting a Persistent States Store to Handle Expiration

Is there any way with a persistent state store to allow for keys in a KeyValueStore to expire? I know there is a retention period in the persistentSessionStore, but it looks like that isn't KeyValue based.
There is no expiration mechanism atm. There is Jira feature request for this though: https://issues.apache.org/jira/browse/KAFKA-4212
What you can do it though, to register a punctuation and delete data manually. You would need to store the timestamp as part of the value though and scan the whole store to find old keys.

spray-cache: Return old value if the future fails

We are using spray-cache (can't move to akka-http yet) to cache results from a downstream service we are calling. The effect we want is, if the data is more than 15 minutes old, do the call, otherwise return the cached data.
Our problem is that, if the service call fails, spray-cache will remove the entry from the cache. What we need is to return the old cached data (even if it's stale), and retry the downstream request when the next request comes in.
It looks like Spray does not ship with a default cache implementation that does what you want. According to the spray-caching docs there are two implementations to the Cache trait: SimpleLruCache and ExpiringLruCache.
What you want is a Cache that distinguishes entry expiration (removal of the entry from the cache) from entry refresh (fetching or calculating a more recent copy of the entry).
Since both default implementations merge these two concepts into a single timeout value I think your best bet will be a write a new Cache implementation that distinguishes refresh from expiration.

Avoid duplicate POSTs with REST

I have been using POST in a REST API to create objects. Every once in a while, the server will create the object, but the client will be disconnected before it receives the 201 Created response. The client only sees a failed POST request, and tries again later, and the server happily creates a duplicate object...
Others must have had this problem, right? But I google around, and everyone just seems to ignore it.
I have 2 solutions:
A) Use PUT instead, and create the (GU)ID on the client.
B) Add a GUID to all objects created on the client, and have the server enforce their UNIQUE-ness.
A doesn't match existing frameworks very well, and B feels like a hack. How does other people solve this, in the real world?
Edit:
With Backbone.js, you can set a GUID as the id when you create an object on the client. When it is saved, Backbone will do a PUT request. Make your REST backend handle PUT to non-existing id's, and you're set.
Another solution that's been proposed for this is POST Once Exactly (POE), in which the server generates single-use POST URIs that, when used more than once, will cause the server to return a 405 response.
The downsides are that 1) the POE draft was allowed to expire without any further progress on standardization, and thus 2) implementing it requires changes to clients to make use of the new POE headers, and extra work by servers to implement the POE semantics.
By googling you can find a few APIs that are using it though.
Another idea I had for solving this problem is that of a conditional POST, which I described and asked for feedback on here.
There seems to be no consensus on the best way to prevent duplicate resource creation in cases where the unique URI generation is unable to be PUT on the client and hence POST is needed.
I always use B -- detection of dups due to whatever problem belongs on the server side.
Detection of duplicates is a kludge, and can get very complicated. Genuine distinct but similar requests can arrive at the same time, perhaps because a network connection is restored. And repeat requests can arrive hours or days apart if a network connection drops out.
All of the discussion of identifiers in the other anwsers is with the goal of giving an error in response to duplicate requests, but this will normally just incite a client to get or generate a new id and try again.
A simple and robust pattern to solve this problem is as follows: Server applications should store all responses to unsafe requests, then, if they see a duplicate request, they can repeat the previous response and do nothing else. Do this for all unsafe requests and you will solve a bunch of thorny problems. Repeat DELETE requests will get the original confirmation, not a 404 error. Repeat POSTS do not create duplicates. Repeated updates do not overwrite subsequent changes etc. etc.
"Duplicate" is determined by an application-level id (that serves just to identify the action, not the underlying resource). This can be either a client-generated GUID or a server-generated sequence number. In this second case, a request-response should be dedicated just to exchanging the id. I like this solution because the dedicated step makes clients think they're getting something precious that they need to look after. If they can generate their own identifiers, they're more likely to put this line inside the loop and every bloody request will have a new id.
Using this scheme, all POSTs are empty, and POST is used only for retrieving an action identifier. All PUTs and DELETEs are fully idempotent: successive requests get the same (stored and replayed) response and cause nothing further to happen. The nicest thing about this pattern is its Kung-Fu (Panda) quality. It takes a weakness: the propensity for clients to repeat a request any time they get an unexpected response, and turns it into a force :-)
I have a little google doc here if any-one cares.
You could try a two step approach. You request an object to be created, which returns a token. Then in a second request, ask for a status using the token. Until the status is requested using the token, you leave it in a "staged" state.
If the client disconnects after the first request, they won't have the token and the object stays "staged" indefinitely or until you remove it with another process.
If the first request succeeds, you have a valid token and you can grab the created object as many times as you want without it recreating anything.
There's no reason why the token can't be the ID of the object in the data store. You can create the object during the first request. The second request really just updates the "staged" field.
Server-issued Identifiers
If you are dealing with the case where it is the server that issues the identifiers, create the object in a temporary, staged state. (This is an inherently non-idempotent operation, so it should be done with POST.) The client then has to do a further operation on it to transfer it from the staged state into the active/preserved state (which might be a PUT of a property of the resource, or a suitable POST to the resource).
Each client ought to be able to GET a list of their resources in the staged state somehow (maybe mixed with other resources) and ought to be able to DELETE resources they've created if they're still just staged. You can also periodically delete staged resources that have been inactive for some time.
You do not need to reveal one client's staged resources to any other client; they need exist globally only after the confirmatory step.
Client-issued Identifiers
The alternative is for the client to issue the identifiers. This is mainly useful where you are modeling something like a filestore, as the names of files are typically significant to user code. In this case, you can use PUT to do the creation of the resource as you can do it all idempotently.
The down-side of this is that clients are able to create IDs, and so you have no control at all over what IDs they use.
There is another variation of this problem. Having a client generate a unique id indicates that we are asking a customer to solve this problem for us. Consider an environment where we have a publicly exposed APIs and have 100s of clients integrating with these APIs. Practically, we have no control over the client code and the correctness of his implementation of uniqueness. Hence, it would probably be better to have intelligence in understanding if a request is a duplicate. One simple approach here would be to calculate and store check-sum of every request based on attributes from a user input, define some time threshold (x mins) and compare every new request from the same client against the ones received in past x mins. If the checksum matches, it could be a duplicate request and add some challenge mechanism for a client to resolve this.
If a client is making two different requests with same parameters within x mins, it might be worth to ensure that this is intentional even if it's coming with a unique request id.
This approach may not be suitable for every use case, however, I think this will be useful for cases where the business impact of executing the second call is high and can potentially cost a customer. Consider a situation of payment processing engine where an intermediate layer ends up in retrying a failed requests OR a customer double clicked resulting in submitting two requests by client layer.
Design
Automatic (without the need to maintain a manual black list)
Memory optimized
Disk optimized
Algorithm [solution 1]
REST arrives with UUID
Web server checks if UUID is in Memory cache black list table (if yes, answer 409)
Server writes the request to DB (if was not filtered by ETS)
DB checks if the UUID is repeated before writing
If yes, answer 409 for the server, and blacklist to Memory Cache and Disk
If not repeated write to DB and answer 200
Algorithm [solution 2]
REST arrives with UUID
Save the UUID in the Memory Cache table (expire for 30 days)
Web server checks if UUID is in Memory Cache black list table [return HTTP 409]
Server writes the request to DB [return HTTP 200]
In solution 2, the threshold to create the Memory Cache blacklist is created ONLY in memory, so DB will never be checked for duplicates. The definition of 'duplication' is "any request that comes into a period of time". We also replicate the Memory Cache table on the disk, so we fill it before starting up the server.
In solution 1, there will be never a duplicate, because we always check in the disk ONLY once before writing, and if it's duplicated, the next roundtrips will be treated by the Memory Cache. This solution is better for Big Query, because requests there are not imdepotents, but it's also less optmized.
HTTP response code for POST when resource already exists

Which data-store to use, to store meta data corresponding to the keys in memcache?

I have a memcache backend and i want to add redis for adding the meta data of the keys of the memcache.
Meta data is as follows:
Miss_count: The number of times the data was not present in the memcache.
Hash_value: The hash value of the data corresponding to the key in the memcache.
Data in memcache : key1 ::: Data
Meta data (miss count) : key1_miss ::: 10
Meta data (hash value) : key1_hash ::: hash(Data)
Please provide help as in which data store is preferable as when i store the meta data in the memcache itself, the meta data is removed well before its expiry time as the size of the meta data is small and the slab allocation is allocating a small memory chuck to it.
As the meta data will increase with time, the hash concept of the redis will fail. Therefore apply a client logic to see that the max_zipped is satisfied.
If I understand your use case correctly I suspect Redis might be a good choice. Assuming you'll be periodically updating the meta data miss counts associated with the various hashes over time, you'd probably want to use Redis sorted sets. For example, if you wanted the miss counts stored in a sorted set called "misscounts", the Redis command to add/update those counts would be one and the same:
zadd misscounts misscount key1
... because zadd adds the entry if one doesn't already exist or overwrites an existing entry if it does. If you have a hook into the process that fires each time a miss occurs, you could instead use:
zincrby misscounts 1 key1
Similar to the the zadd command behavior, zincrement will create a new entry (using the increment value as the count) if one doesn't exist, or increment the existing count by the increment value you pass if an entry does exist.
Complete documentation of Redis commands can be found here. Descriptions of the different types of storage options in Redis is detailed here.
Oh, and a final note. In my experience, Redis is THE SHIT. Sorry to curse (in caps), but there's simply no other way to do Redis justice. We call our Redis server "honey badger", because when load starts increasing and our other servers start auto-scaling, honey badger just don't give a shit.