ETags and collections

ETags and collections - rest

Many REST APIs provide the ability to search for resources.
For example, resources of type A may be fetched using the following HTTP request:
GET /A?prop1={value1}&prop2={value2}
I'm using optimistic locking and therefore would like to return a version for every returned resource of type A. Until now, I used the ETag header when fetching only one resource using its ID.
Is there an HTTP way for returning version for multiple resources in the same response? If not, should I include the versions in the body?
Thanks,
Mickael
EDIT: I found on the web that the ETag is often generated by computing a hash of part of the reply. This approach fits well with my case since a hash of the returned collection will be computed. However, if the client decides to update one of the elements in the collection, which ETag should he put in the If-Match header? I'm thinking that including the ETags of the individual elements is the only solution...

I would adopt one of these options:
Make ETags weak by default and they are generated with the resource current state, not with the resource representation in the HTTP response payload. With that, I can return a valid ETag for each resource in the collection query response body, besides the ETag for the whole collection in the response header.
Forget If-Match and ETags for this case and use If-Unmodified-Since with a Last-Modified supplied as a property of each resource. By doing that I can preserve the strong ETags, but clients can still make updates to one item based on the collection response without the need for another request to the resource itself.
Allow updates via PATCH on the collection resource itself, using the If-Match header with the ETag for the whole collection. This probably won't work very well if there's a lot of concurrent changes, but it's a reasonable approach.

I think it depends a little bit on the amount of resources, data and requests to reduce bandwith. But a solution could be to separate the resources in sub-requests.
Assume that the group call of GET /images?car=mustang&viewangle=front returns 5 images. Now you could include all images as binary data and the GET-request itself has a unique ETag:
GET /images?car=mustang&viewangle=front
...
HTTP 1.1 200 OK
ETag "aaaaaa"
data:image/png;base64,a123456....
data:image/png;base64,b123456....
data:image/png;base64,c123456....
data:image/png;base64,d123456....
data:image/png;base64,e123456....
The problem is now, that one added image changes the ETag of the group call and you need to transfer the complete set again altough only one image has changed:
GET /images?car=mustang&viewangle=front
If-None-Match "aaaaaa"
...
HTTP 1.1 200 OK
ETag "bbbbbb"
data:image/png;base64,a123456....
data:image/png;base64,b123456....
data:image/png;base64,c123456....
data:image/png;base64,d123456....
data:image/png;base64,e123456....
data:image/png;base64,f123456....
In this case the best solution would be that you separate the resources data from the group call. So the response includes only information for sub-requests:
GET /images?car=mustang&viewangle=front
...
HTTP 1.1 200 OK
ETag "aaaaaa"
a.jpg
b.jpg
c.jpg
d.jpg
e.jpg
By that every sub-request can be cached separatly:
GET /image/?src=a.jpg
If-None-Match "Akj5odjr"
...
HTTP 1.1 304 Not Modified
Statistics
- First request = 6x 200 OK
- Future requests if group unchanged = 1x 304 Not Modified
- Future requests if one new resource has been added = 2x 200 OK, 5x 304 Not Modified
Now I would tune the API documentation. This means the requester must check if a cache of a sub-request is available before making a call to it. This could be done by providing the ETags (or other hash) in the group request:
GET /images?car=mustang&viewangle=front
...
HTTP 1.1 200 OK
...
ETag "aaaaaa"
a.jpg;AfewrKJD
b.jpg;Bgnweidk
c.jpg;Ckirewof
d.jpg;Dt34gsd0
e.jpg;Egk29dds
f.jpg;F498wdn4
Now the requester checks the cache and finds out that a.jpg has a new ETag called Akj5odjr and f.jpg;F498wdn4 is a new entry. By that future requests are reduced:
Statistics
- First request = 6x 200 OK
- Future requests if group unchanged = 1x 304 Not Modified
- Future requests if one new resource has been added = 2x 200 OK
Conclusion
Finally you need to think about if your resources are big enough to put them in sub-requests and how often one requester repeats bis group request (so the cache is used). If not, you should include them in the group call and you do not have room for optimization.
P.S. you need to monitor all requesters to be sure all of them use caches. A possible solution would be to ban requesters calling an API URL two or more times without sending an ETag.

Related

RESTful - How to update a subresource and what ETag/payload to return concerning optimistic locking?

As an example I have an order where the invoicing address can be modified. The change might trigger various additional actions (i.e. create a cancellation invoice and a new invoice with the updated address).
As recommended by various sources (see below) I don't want to have a PATCH on the order resource, because it has many other properties, but want to expose a dedicated endpoint, also called "intent" resource or subresource according to the web links below:
/orders/{orderId}/invoicing-address
Should I use a POST or a PATCH against this subresource?
The invoicing address itself has no ID. In the domain layer it is represented as a value object that is part of the order entity.
What ETag should be used for the subresource?
The address is part of the order and together with the items they form an aggregate in the domain layer. When the aggregate is updated it gets a new version number in the database. That version number is used as an ETag for optimistic locking.
Should a GET on invoicing-address respond with the order aggregate version number or a hash value of the address DTO in the ETag header?
What payload should be returned after updating the address?
Since the resource is the invoicing address it seems natural to return the updated address object (maybe with server side added fields). Should the body also include the ID/URI and the ETag of the order resource?
None of the examples I found with subresources showed any server responses or considered optimistic locking.
https://rclayton.silvrback.com/case-against-generic-use-of-patch-and-put
https://www.thoughtworks.com/insights/blog/rest-api-design-resource-modeling
https://softwareengineering.stackexchange.com/questions/371273/design-update-properties-on-an-entity-in-a-restful-resource-based-api (see provided answer)
https://www.youtube.com/watch?v=aQVSzMV8DWc&t=188s (Jim Webber at about about 31 mins)

As far as REST is concerned, "subresources" aren't a thing. /orders/12345/invoicing-address identifies a resource. The fact that this resource has a relationship with another resource identified by /orders/12345 is irrelevant.
Thus, the invoicing-address resource should understand HTTP methods exactly the same way as every other resource on the web.
Should I use a POST or a PATCH against this subresource?
Use PUT/PATCH if you are proposing a direct change to the representation of the resource. For example, these are the HTTP methods we would use if we were trying to fix a spelling error in an HTML document (PUT if we were sending a complete copy of the HTML document; PATCH if we were sending a diff).
PUT /orders/12345/invoicing-address
Content-Type: text/plain
1060 W Addison St.
Chicago, IL
60613
On the other hand, if you are proposing an indirect change to the representation of the resource (the request shows some information to the server, and the server is expected to compute a new representation itself)... well, we don't have a standardized method that means exactly that; therefore, we use POST
POST serves many useful purposes in HTTP, including the general purpose of “this action isn’t worth standardizing.” -- Fielding, 2009
What ETag should be used for the subresource?
You should first give some thought to whether you want to use a strong-validator or a weak validator
A strong validator is representation metadata that changes value whenever a change occurs to the representation data that would be observable in the content of a 200 (OK) response to GET.
...
In contrast, a weak validator is representation metadata that might
not change for every change to the representation data.
...
a weak entity-tag ought to change whenever the origin server wants caches to invalidate old responses.
I might use a weak validator if the representation included volatile but insignificant information; I don't need clients to refresh their copy of a document because it doesn't have the latest timestamp metadata. But I probably wouldn't use an "aggregate version number" if I expected the aggregate to be changing more frequently than the invoicing-address itself changes.
What payload should be returned after updating the address?
See 200 OK.
In the case of a POST request, sending the current representation of the resource (after changes have been made to it) is nice because the response is cacheable (assuming you include the appropriate metadata signals in the response headers).
Responses to PATCH have similar rules to POST (see RFC 5789).
PUT is the odd man out, here
Responses to the PUT method are not cacheable.
Should the body also include the ID/URI and the ETag of the order resource?
Entirely up to you - HTTP components aren't going to be paying attention to the representation, so you can design that representation as makes sense to you. On the web, it's perfectly normal to return HTML documents with links to other HTML documents.

Response body when PATCHing a collection

In my REST API I have a very large collection; it contains millions of items. The path for this collection is /mycollection
Because this collection is so large it is not good practice to GET the whole collection so the API supports paging. Paging will be the primary way of getting the collection
GET /mycollection?page=1&page-size=100 HTTP/1.1
Say the original collection contains 1,000,000 items and I want to update 5,000, delete 3,000 and add 2,000 items. I could write my API to support updating the collection via either the PUT method or the PATCH method. While either method would require very different request bodies I believe both methods would require the exact same response body, i.e. the response body would have to contain the current representation of the entire updated resource, i.e. all 999,000 items in the collection.
As I mentioned earlier GETting the entire collection is just not realistic; it's too big. For the same reason I don't want PUTting or PATCHing to return the entire collection. Adding query parameters to a PUT or PATCH request wouldn't work either because neither PUT nor PATCH are safe methods.
So what would be the proper response in this large collection scenario?
I could respond with
HTTP/1.1 202 Accepted
Location: /mycollection?page=1&page-size=100
The 202 Accepted response code doesn't feel correct because the update would have been done synchronously. The Location header doesn't quite feel right either. I could maybe go with a Links header, but still it doesn't feel right.
Again I ask what would be the proper response in this large collection scenario?

This question is based on a misconception:
While either method would require very different request bodies I believe both methods would require the exact same response body, i.e. the response body would have to contain the current representation of the entire updated resource
Either can just return 204 No Content or 200 OK and no response body. There's no requirement that they include the full representation in the response.
You could optionally support this (perhaps along with the Prefer: return=representation header, or perhaps Content-Location header), but without this header I would say it's not even a convention that the current representation is returned. Generic clients shouldn't assume that the response body is the new representation unless these headers are used.
So, just return a 2xx and you're good to go.

So what would be the proper response in this large collection scenario?
Short version: you should probably treat a successful PUT as though it were a successful POST.
the intended meaning of the payload can be summarized as:
a representation of the status of, or results obtained from, the action
So the response could be as simple as
200 OK
Content-Type: text/plain
It worked!
Longer answer:
While either method would require very different request bodies I believe both methods would require the exact same response body, i.e. the response body would have to contain the current representation of the entire updated resource
This isn't right - If you review RFC 7231, you'll see that the response to PUT has this description
a representation of the status of the action
Returning the new representation of the resource is an edge case, not the default (see the specification of the Content-Location header).
For a state-changing request like PUT (Section 4.3.4) or POST (Section 4.3.3), it implies that the server's response contains the new representation of that resource, thereby distinguishing it from representations that might only report about the action (e.g., "It worked!"). This allows authoring applications to update their local copies without the need for a subsequent GET request.
That said, I'd suggest a review of your choice of method token. Both PUT and PATCH support remote authoring semantics - messages that ask a server to make its copy of a document look like your local copy. That's why, for example, the PUT specification has a bunch of constraints about adding validator header fields to the response. General purpose components are allowed to assume that they know what's going on, because all resources are supposed to understand these methods the same way.
But in your case, you can't really be said to be remote authoring the collection, because the client (and the general purpose components) don't have a representation of the collection, but instead only representations of pages of the collection.
If you were going to be consistent with the uniform interface, then you would either
allow remote authoring of the pages, or
abandon the method tokens that imply remote authoring
It is okay to use POST when the semantics of your request don't quite align with the standardized meanings
POST serves many useful purposes in HTTP, including the general purpose of “this action isn’t worth standardizing.”

REST PUT create or update

I want to create or update an item in one go with the following request:
PUT /items/{id}
Should I return 201 Created if the item has been created and 204 No Content when it has been updated? Or should I return the same status for both actions?

The answer is clearly stated in RFC2616, section 9.6:
If a new resource is created, the origin server MUST inform the user agent via the 201 (Created) response. If an existing resource is modified, either the 200 (OK) or 204 (No Content) response codes SHOULD be sent to indicate successful completion of the request.

TL;DR
Use 200 OK in both cases if the distinction isn't important to clients.
Use 201 Created upon creation, and 200 OK upon update, if the distinction is important.
Details
With RESTful design, there's no single, correct way of doing things, but I've often found that the original HTTP status code specification provides good guidance.
It's always fine to return 200 OK if you don't expect the client to take separate action based on the response, but you can provide more information if you think the client is going to need it. Just be aware that clients may come to expect that of your API, so once you start providing more details, you can't easily change your mind.
That said, if you want to return 201 Created, then according to the specification:
The response SHOULD include an entity containing a list of resource characteristics and location(s) from which the user or user agent can choose the one most appropriate.
As an example, you can put the location in the Location header, but you can also put it in the body of the response (but then you must document the structure and content-type of that body).
When it comes to 204 No Content responses, they create dead ends for link-following clients, so if you're creating a true level 3 REST API then you should avoid that response type as a courtesy to clients.

Why would ETags set to a MUST requirement if you already have the resource?

Why would you set ETags to a "MUST requirement level"?
You obtains the resource before the ETags returned...
I'm working on a project where I am the client that sends HTTP requests to a server that returns an HTTP Cache-Control header with ETags to cache response (where in each addition request it gets compared to the If-None-Match header to determine if the data is stale and if a new request should be made). In my current project the ETags parameter is using the conditional GET architecture with the MUST requirement level as specified in RFC 2119.
MUST This word, or the terms "REQUIRED" or "SHALL", mean that the definition is an absolute requirement of the specification.
I don't understand the intent of using a conditional GETwith the MUST requirement level? From my understanding the MUST requirement is there to limit (is that right?) the resources provided to the client that makes the request, however the client (me in this case) already has the resources from the first request. Where I can continue obtaining the same resource (or a fresher resource if it gets updated) as much as I want with or without returning the If-None-Match and ETag header fields.
What would be the purpose of setting it to the MUST requirement level in this case if it's not limiting the resources returned, Aside from being able to cache and limiting the amount of requests to the server (Im asking from the client point of view, yes I know I can cache it but why the MUST requirement)? Isn't this only used for limiting resources?
So basically, doesn't it make this MUST requirement not a requirement if I can obtain the resources with or without it? Am I missing something here?
My Question is not asking the what and how Etags, Cache-Control, or If-None-Match headers work.
Thanks in advance, cheers!

Why would ETags set to a MUST requirement if you already have the resource?
A client MUST use a conditional GET to reduce the data traffic.
Aside from being able to cache and limiting the amount of requests to the server
The number of requests stays the same, but the total number of data transferred changes.
Using ETags in if-none-matched GET requests (conditional GET)
When you make a API call, the response header includes an ETag with a value that is the hash of the data returned in the API call. You store this ETag value for use in the next request.
The next time you make the same API call, you include the If-None-Match request header with the ETag value stored from the first step.
If the data has not changed, the response status code will be 304 – Not Modified and no data is returned.
If the data has changed since the last query, the data is returned as usual with a new ETag. The game starts again: you store the new ETag value and use it for subsequent requests.
Why?
The main reason for using conditional GET requests is to reduce data traffic.
Isn't this only used for limiting resources?
No...
You can ask an API for multiple resources in one request.
(Ok, thats also limiting resources by saving the other requests.)
You can prevent a method (e.g. PUT) from modifying an existing resource, when the client believes that the resource does not exist (replace protection).
I can obtain the resources with or without it?
When you ignore the "MUST use conditional GET" then (a) the traffic will increase and (b) you lose the "resource has changed" indication coming from server-side. You would have to implement the comparison handling on client side: is the resource of the second request newer than the one from the first request.

I found my question wasn't asking the "right question" due to me merging my understand of other headers (thanks to #dcerecedo's comment to get my pointed in the right direction) that were affecting my understand of why MUST was being used.
The MUST was more relivent to other headers, in my case private, max-age=3600 and must-revalidate
Where
Cache-Control: private restricts proxy servers from caching it, this helps you keep your data off a server you dont trust and prevents a proxy from caching user specific data that’s not relevant to everyone (like a user profile).
Cache-Control "max-age=3600, must-revalidate" tell both client caches and proxy caches that once the content is stale (older than 3600 seconds) they must revalidate at the origin server before they can serve the content. This should be the default behavior of caching systems, but the must-revalidate directive makes this requirement unambiguous.
Where after the max-age expires the client should revalidate. It might revalidate using the If-Match or If-None-Match headers with an ETag, or it might use the If-Modified-Since or If-Unmodified-Since headers with a date. So, after expiration the browser will check at the server if the file is updated. If not, the server will respond with a 304 Not Modified header and nothing is downloaded.

RESTful design: using ETag and If-None-Match for fetching new items in a collection?

I'm designing a RESTful web service, and trying to come up with a good way to handle caching and synchronization of collections of items. I've read about the use of Etag and If-None-Match headers for optimizing the caching of individual resources, and I'm wondering if they can be (or should be) used for collections.
I'm interested in your feedback on the following approach:
A collection of items is exposed at the URI http://foobar.com/items
A client making an initial GET request on that URI will not include an If-None-Match header. In this case, the server returns all of the items (or some amount at the server's discretion, say, the most recent N items). The response contains an Etag header indicating a "tick stamp" that represents the the current pseudo-time of the server's pseudo-clock (e.g. some counter that increases each time there is a change to the data).
The client caches the returned items.
On subsequent GET requests, the client includes the previously received Etag value in its If-None-Match header. The server checks to see if it has any items newer than the If-None-Match header, and if so, returns ONLY the newer items. Otherwise, it returns a 304 status ("not modified") and no items.
Question - am I subverting the semantic meaning of a GET by only returning the newer items in #4, rather than the whole collection (which would include items already cached on the client)? Or does this seem like a reasonable approach? Can you suggest alternative approaches that would be better?
Thanks in advance.

Usually, it is up to the client to do the ETAG comparison after a HEAD request, and if necessary, follow-up with a more specific request (a range request or a query for entries newer than a certain timestamp). There server should simply serve-up the requested resources.
The idea is that intermediary caching proxies can be inserted in the chain of communication without either the client or server having to change its code.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse