Why memcached set operation is not idempotent? - memcached

On the page 2 of the Facebook's paper "Scaling Memcache at Facebook" they said "For write requests,the webserver issues SQL statements to the database and then sends a delete request to memcache that invalidates any stale data. We choose to delete cached data instead of updating it because deletes are idempotent."
Why update/set is not idempotent operation?
Paper can be found here: https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf

If you call one delete after another two times second delete won't make any effect. Update/set here will act differently: while it won't change value associated with the key, it will update the last access time for the key changing the logic of when the key will be evicted. In this sense delete key operation is idempotent while update key's value is not.
E.g. in the paper they don't want to keep a key in the cache if no one ever tried to read it (even if there are lots of writes for that key in the database).

Related

The correct HTTP method for resetting data

I want to "reset" certain data in my database tables using a RESTful method but I'm not sure which one I should be using. UPDATE because I'm deleting records and updating the record where the ID is referenced (but never removing the record where ID lives itself), or DELETE because my main action is to delete associated records and updating is tracking those changes?
I suppose this action can be ran multiple times but the result will be the same, however the ID will always be found when "resetting".
I think you want the DELETE method

What should DELETE /collection do if some items can't be deleted?

I have a collection of items and some of them may or may not be deleted, depending on some preconditions. If a user wants to delete a resource (DELETE /collection/1) and there are external dependencies on that resource, the server will return an error. But what should happen if the user wants to delete the entire collection (DELETE /collection)?
Should all the resources which can be deleted be deleted and the server return a 2xx, or should the server leave everything intact and return a 4xx? Which would be the expected behavior?
As a REST API consumer, I'd expect the operation to be atomic and maybe get back a 409 Conflict with details if one of the deletes fails. Plus the DELETE method is theoretically idempotent as #jbarrueta pointed out.
Now if undeletable resources is a normal event in your use case and happens frequently, you may want to stray from the norm a little bit, delete all that can be deleted and return something like a 206 Partial Content (don't know if that's legal for DELETE though) with details about undeleted resources.
However, if you need to manage error cases finely, you might be better off sending separate DELETE commands.
I think the proper result is 204 no content by success and 409 conflict by failure because of the dependencies (as the others pointed out). I support atomicity as well.
I think you are thinking about REST as SOAP/RPC, which it is clearly not. Your REST service MUST fulfill the uniform interface constraint, which includes the HATEOAS interface constraint, so you MUST send hyperlinks to the client.
If we are talking about a simple link, like DELETE /collection, then you must send the link to the client, only if the resource state transition it represents, is available from the current resource state. So if you cannot delete the collection because of the dependencies, then you don't send a link about this transition, because it is not possible.
If it is a templated link, then you have to attach the "removable" property to the items, and set the checkboxes to disabled if it is false.
This way conflict happens only when the client got the link from a representation of a previous (stale) resource state, and in that case you must update the client state by querying the server again with GET.
Another possible solution (ofc. in combination with the previous ones) to show the link and automatically remove dependencies.
I guess you can use PATCH for bulk updates, which includes bulk removal, so that can be another solution as well.
I'd say it depends on your domain (although I'd rather use DELETE /collection/all instead of DELETE/collection/),
When you have the situation where you use delete all but some items can't be deleted, it depends on your domain where if you are doing the delete all to free up resources where if not your business process suffers, then it's better to delete what can be deleted and put other into a retry queue makes sense. in that case response should be OK.
Also situations could arise where there could be two operations
Clean Up - only delete unused
Delete All - delete all
In either situation I'd rather use specific method rather than using DELETE on the root URL,
for Clean Up - DELETE /collection/unused
for Delete ALL - DELETE /collection/all

Should the natural or surrogate key be returned in an API?

First time I think about it...
Until now, I always used the natural key in my API. For example, a REST API allowing to deal with entities, the URL would be like /entities/{id} where id is a natural key known to the user (the ID is passed to the POST request that creates the entity). After the entity is created, the user can use multiple commands (GET, DELETE, PUT...) to manipulate the entity. The entity also has a surrogate key generated by the database.
Now, think about the following sequence:
A user creates entity with id 1. (POST /entities with body containing id 1)
Another user deletes the entity (DELETE /entities/1)
The same other user creates the entity again (POST /entities with body containing id 1)
The first user decides to modify the entity (PUT /entities/1 with body)
Before step 4 is executed, there is still an entity with id 1 in the database, but it is not the same entity created during step 1. The problem is that step 4 identifies the entity to modify based on the natural key which is the same for the deleted and new entity (while the surrogate key is different). Therefore, step 4 will succeed and the user will never know it is working on a new entity.
I generally also use optimistic locking in my applications, but I don't think it helps here. After step 1, the entity's version field is 0. After step 3, the new entity's version field is also 0. Therefore, the version check won't help. Is the right case to use timestamp field for optimistic locking?
Is the "good" solution to return surrogate key to the user? This way, the user always provides the surrogate key to the server which can use it to ensure it works on the same entity and not on a new one?
Which approach do you recommend?
It depends on how you want your users to user your api.
REST APIs should try to be discoverable. So if there is benefit in exposing natural keys in your API because it will allow users to modify the URI directly and get to a new state, then do it.
A good example is categories or tags. We could have these following URIs;
GET /some-resource?tag=1 // returns all resources tagged with 'blue'
GET /some-resource?tag=2 // returns all resources tagged with 'red'
or
GET /some-resource?tag=blue // returns all resources tagged with 'blue'
GET /some-resource?tag=red // returns all resources tagged with 'red'
There is clearly more value to a user in the second group, as they can see that the tag is a real word. This then allows them to type ANY word in there to see whats returned, whereas the first group does not allow this: it limits discoverability
A different example would be orders
GET /orders/1 // returns order 1
or
GET /orders/some-verbose-name-that-adds-no-meaning // returns order 1
In this case there is little value in adding some verbose name to the order to allow it to be discoverable. A user is more likely to want to view all orders first (or a subset) and filter by date or price etc, and then choose an order to view
GET /orders?orderBy={date}&order=asc
Additional
After our discussion over chat, your issue seems to be with versioning and how to manage resource locking.
If you allow resources to be modified by multiple users, you need to send a version number with every request and response. The version number is incremented when any changes are made. If a request sends an older version number when trying to modify a resource, throw an error.
In the case where you allow the same URIs to be reused, there is a potential for conflict as the version number always begins from 0. In this case, you will also need to send over a GUID (surrogate key) and a version number. Or don't use natural URIs (see original answer above to decided when to do this or not).
There is another option which is to disallow reuse of URIs. This really depends on the use case and your business requirements. It may be fine to reuse a URI as conceptually it means the same thing. Example would be if you had a folder on your computer. Deleting the folder and recreating it, is the same as emptying the folder. Conceptually the folder is the same 'thing' but with different properties.
User account is probably an area where reusing URIs is not a good idea. If you delete an account /accounts/u1, that URI should be marked as deleted, and no other user should be able to create an account with username u1. Conceptually, a new user using the same URI is not the same as when the previous user was using it.
Its interesting to see people trying to rediscover solutions to known problems. This issue is not specific to a REST API - it applies to any indexed storage. The only solution I have ever seen implemented is don't re-use surrogate keys.
If you are generating your surrogate key at the client, use UUIDs or split sequences, but for preference do it serverside.
Also, you should never use surrogate keys to de-reference data if a simple natural key exists in the data. Indeed, even if the natural key is a compound entity, you should consider very carefully whether to expose a surrogate key in the API.
You mentioned the possibility of using a timestamp as your optimistic locking.
Depending how strictly you're following a RESTful principle, the Entity returned by the POST will contain an "edit self" link; this is the URI to which a DELETE or UPDATE can be performed.
Taking your steps above as an example:
Step 1
User A does a POST of Entity 1. The returned Entity object will contain a "self" link indicating where updates should occur, like:
/entities/1/timestamp/312547124138
Step 2
User B gets the existing Entity 1, with the above "self" link, and performs a DELETE to that timestamp versioned URI.
Step 3
User B does a POST of a new Entity 1, which returns an object with a different "self" link, e.g.:
/entities/1/timestamp/312547999999
Step 4
User A, with the original Entity that they obtained in Step 1, tries doing a PUT to the "self" link on their object, which was:
/entities/1/timestamp/312547124138
...your service will recognise that although Entity 1 does exist; User A is trying a PUT against a version which has since become stale.
The service can then perform the appropriate action. Depending how sophisticated your algorithm is, you could either merge the changes or reject the PUT.
I can't remember the appropriate HTTP status code that you should return, following a PUT to a stale version... It's not something that I've implemented in the Rest framework that I work on, although I have planned to enable it in future. It might be that you return a 410 ("Gone").
Step 5
I know you don't have a step 5, but..! User A, upon finding their PUT has failed, might re-retrieve Entity 1. This could be a GET to their (stale) version, i.e. a GET to:
/entities/1/timestamp/312547124138
...and your service would return a redirect to GET from either a generic URI for that object, e.g.:
/entities/1
...or to the specific latest version, i.e.:
/entities/1/timestamp/312547999999
They can then make the changes intended in Step 4, subject to any application-level merge logic.
Hope that helps.
Your problem can be solved either using ETags for versioning (a record can only modified if the current ETag is supplied) or by soft deletes (so the deleted record still exists but with a trashed bool which is reset by a PUT).
Sounds like you might also benefit from a batch end point and using transactions.

Avoid duplicate POSTs with REST

I have been using POST in a REST API to create objects. Every once in a while, the server will create the object, but the client will be disconnected before it receives the 201 Created response. The client only sees a failed POST request, and tries again later, and the server happily creates a duplicate object...
Others must have had this problem, right? But I google around, and everyone just seems to ignore it.
I have 2 solutions:
A) Use PUT instead, and create the (GU)ID on the client.
B) Add a GUID to all objects created on the client, and have the server enforce their UNIQUE-ness.
A doesn't match existing frameworks very well, and B feels like a hack. How does other people solve this, in the real world?
Edit:
With Backbone.js, you can set a GUID as the id when you create an object on the client. When it is saved, Backbone will do a PUT request. Make your REST backend handle PUT to non-existing id's, and you're set.
Another solution that's been proposed for this is POST Once Exactly (POE), in which the server generates single-use POST URIs that, when used more than once, will cause the server to return a 405 response.
The downsides are that 1) the POE draft was allowed to expire without any further progress on standardization, and thus 2) implementing it requires changes to clients to make use of the new POE headers, and extra work by servers to implement the POE semantics.
By googling you can find a few APIs that are using it though.
Another idea I had for solving this problem is that of a conditional POST, which I described and asked for feedback on here.
There seems to be no consensus on the best way to prevent duplicate resource creation in cases where the unique URI generation is unable to be PUT on the client and hence POST is needed.
I always use B -- detection of dups due to whatever problem belongs on the server side.
Detection of duplicates is a kludge, and can get very complicated. Genuine distinct but similar requests can arrive at the same time, perhaps because a network connection is restored. And repeat requests can arrive hours or days apart if a network connection drops out.
All of the discussion of identifiers in the other anwsers is with the goal of giving an error in response to duplicate requests, but this will normally just incite a client to get or generate a new id and try again.
A simple and robust pattern to solve this problem is as follows: Server applications should store all responses to unsafe requests, then, if they see a duplicate request, they can repeat the previous response and do nothing else. Do this for all unsafe requests and you will solve a bunch of thorny problems. Repeat DELETE requests will get the original confirmation, not a 404 error. Repeat POSTS do not create duplicates. Repeated updates do not overwrite subsequent changes etc. etc.
"Duplicate" is determined by an application-level id (that serves just to identify the action, not the underlying resource). This can be either a client-generated GUID or a server-generated sequence number. In this second case, a request-response should be dedicated just to exchanging the id. I like this solution because the dedicated step makes clients think they're getting something precious that they need to look after. If they can generate their own identifiers, they're more likely to put this line inside the loop and every bloody request will have a new id.
Using this scheme, all POSTs are empty, and POST is used only for retrieving an action identifier. All PUTs and DELETEs are fully idempotent: successive requests get the same (stored and replayed) response and cause nothing further to happen. The nicest thing about this pattern is its Kung-Fu (Panda) quality. It takes a weakness: the propensity for clients to repeat a request any time they get an unexpected response, and turns it into a force :-)
I have a little google doc here if any-one cares.
You could try a two step approach. You request an object to be created, which returns a token. Then in a second request, ask for a status using the token. Until the status is requested using the token, you leave it in a "staged" state.
If the client disconnects after the first request, they won't have the token and the object stays "staged" indefinitely or until you remove it with another process.
If the first request succeeds, you have a valid token and you can grab the created object as many times as you want without it recreating anything.
There's no reason why the token can't be the ID of the object in the data store. You can create the object during the first request. The second request really just updates the "staged" field.
Server-issued Identifiers
If you are dealing with the case where it is the server that issues the identifiers, create the object in a temporary, staged state. (This is an inherently non-idempotent operation, so it should be done with POST.) The client then has to do a further operation on it to transfer it from the staged state into the active/preserved state (which might be a PUT of a property of the resource, or a suitable POST to the resource).
Each client ought to be able to GET a list of their resources in the staged state somehow (maybe mixed with other resources) and ought to be able to DELETE resources they've created if they're still just staged. You can also periodically delete staged resources that have been inactive for some time.
You do not need to reveal one client's staged resources to any other client; they need exist globally only after the confirmatory step.
Client-issued Identifiers
The alternative is for the client to issue the identifiers. This is mainly useful where you are modeling something like a filestore, as the names of files are typically significant to user code. In this case, you can use PUT to do the creation of the resource as you can do it all idempotently.
The down-side of this is that clients are able to create IDs, and so you have no control at all over what IDs they use.
There is another variation of this problem. Having a client generate a unique id indicates that we are asking a customer to solve this problem for us. Consider an environment where we have a publicly exposed APIs and have 100s of clients integrating with these APIs. Practically, we have no control over the client code and the correctness of his implementation of uniqueness. Hence, it would probably be better to have intelligence in understanding if a request is a duplicate. One simple approach here would be to calculate and store check-sum of every request based on attributes from a user input, define some time threshold (x mins) and compare every new request from the same client against the ones received in past x mins. If the checksum matches, it could be a duplicate request and add some challenge mechanism for a client to resolve this.
If a client is making two different requests with same parameters within x mins, it might be worth to ensure that this is intentional even if it's coming with a unique request id.
This approach may not be suitable for every use case, however, I think this will be useful for cases where the business impact of executing the second call is high and can potentially cost a customer. Consider a situation of payment processing engine where an intermediate layer ends up in retrying a failed requests OR a customer double clicked resulting in submitting two requests by client layer.
Design
Automatic (without the need to maintain a manual black list)
Memory optimized
Disk optimized
Algorithm [solution 1]
REST arrives with UUID
Web server checks if UUID is in Memory cache black list table (if yes, answer 409)
Server writes the request to DB (if was not filtered by ETS)
DB checks if the UUID is repeated before writing
If yes, answer 409 for the server, and blacklist to Memory Cache and Disk
If not repeated write to DB and answer 200
Algorithm [solution 2]
REST arrives with UUID
Save the UUID in the Memory Cache table (expire for 30 days)
Web server checks if UUID is in Memory Cache black list table [return HTTP 409]
Server writes the request to DB [return HTTP 200]
In solution 2, the threshold to create the Memory Cache blacklist is created ONLY in memory, so DB will never be checked for duplicates. The definition of 'duplication' is "any request that comes into a period of time". We also replicate the Memory Cache table on the disk, so we fill it before starting up the server.
In solution 1, there will be never a duplicate, because we always check in the disk ONLY once before writing, and if it's duplicated, the next roundtrips will be treated by the Memory Cache. This solution is better for Big Query, because requests there are not imdepotents, but it's also less optmized.
HTTP response code for POST when resource already exists

Can Primary-Keys be re-used once deleted?

0x80040237 Cannot insert duplicate key.
I'm trying to write an import routine for MSCRM4.0 through the CrmService.
This has been successful up until this point. Initially I was just letting CRM generate the primary keys of the records. But my client wanted the ability to set the key of a our custom entity to predefined values. Potentially this enables us to know what data was created by our installer, and what data was created post-install.
I tested to ensure that the Guids can be set when calling the CrmService.Update() method and the results indicated that records were created with our desired values. I ran my import and everything seemed successful. In modifying my validation code of the import files, I deleted the data (through the crm browser interface) and tried to re-import. Unfortunately now it throws and a duplicate key error.
Why is this error being thrown? Does the Crm interface delete the record, or does it still exist but hidden from user's eyes? Is there a way to ensure that a deleted record is permanently deleted and the Guid becomes free? In a live environment, these Guids would never have existed, but during my development I need these imports to be successful.
By the way, considering I'm having this issue, does this imply that statically setting Guids is not a recommended practice?
As far I can tell entities are soft-deleted so it would not be possible to reuse that Guid unless you (or the deletion service) deleted the entity out of the database.
For example in the LeadBase table you will find a field called DeletionStateCode, a value of 0 implies the record has not been deleted.
A value of 2 marks the record for deletion. There's a deletion service that runs every 2(?) hours to physically delete those records from the table.
I think Zahir is right, try running the deletion service and try again. There's some info here: http://blogs.msdn.com/crm/archive/2006/10/24/purging-old-instances-of-workflow-in-microsoft-crm.aspx
Zahir is correct.
After you import and delete the records, you can kick off the deletion service at a time you choose with this tool. That will make it easier to test imports and reimports.