Are single use download links in accordance with HTTP spec? - rest

In the context of a restful web service, is it acceptable to have side effects for GET methods?
Single use download links for example
GET /downloads/664d92b3-b373-4dac-a4fb-7a41d015109a
will return 200 and "the thing" and 404 on next request.
HTTP spec says GET methods should be safe and according to https://www.rfc-editor.org/rfc/rfc7231#section-4.2.1
Request methods are considered "safe" if their defined semantics are
essentially read-only; i.e., the client does not request, and does
not expect, any state change on the origin server as a result of
applying a safe method to a target resource.
and
This definition of safe methods does not prevent an implementation
from including behavior that is potentially harmful, that is not
entirely read-only, or that causes side effects while invoking a safe
method. What is important, however, is that the client did not
request that additional behavior and cannot be held accountable for
it.
Several clarifying examples are provided which make me think safe methods are not allowed to purposefully remove the resource.
For example, most servers append request information to access
log files at the completion of every response, regardless of the
method, and that is considered safe even though the log storage might
become full and crash the server.
And
Likewise, a safe request initiated
by selecting an advertisement on the Web will often have the side
effect of charging an advertising account.
And
For example, it is
common for Web-based content editing software to use actions within
query parameters, such as "page?do=delete". If the purpose of such a
resource is to perform an unsafe action, then the resource owner MUST
disable or disallow that action when it is accessed using a safe
request method.
Single use links are obviously a reality. I just wonder whether they're abusing the spec or I just don't get it.
Having an opinion is fine but having worked on these specs and understanding their subtleties would be most convincing.

What you're suggesting is acceptable in some situations, and not necessarily an abuse of the spec.
Firstly, 2616 says regarding safe methods that they:
SHOULD NOT have the significance of taking an action other than
retrieval
And the phrase "SHOULD NOT" is defined as follows (emphasis added):
This phrase, or the phrase "NOT RECOMMENDED" mean that there may
exist valid reasons in particular circumstances when the particular
behavior is acceptable or even useful, but the full implications
should be understood and the case carefully weighed before
implementing any behavior described with this label.
The new version you linked to (which I think supercedes 2616) doesn't use the term "SHOULD NOT" - but they haven't replaced it with "MUST NOT" either. They also acknowledge that side effects are not ruled out as long as the client is not held responsible. So I think the idea of safe methods is the same.
So since the spec acknowledges that there are situations where it's ok, how do we know if yours is such a situation - and more importantly, how do we stay generally within the "spirit" of the spec i.e. make sure we're not abusing it?
I'd refer to this quote from 7231:
The purpose of distinguishing between safe and unsafe methods is to
allow automated retrieval processes (spiders) and cache performance
optimization (pre-fetching) to work without fear of causing harm.
If your app is a private intranet app and you're not concerned with the issues mentioned here, your approach is ok. Put another way: taking into consideration all the possible ways that a GET could happen, are you ok with this side effect?
Working outside RESTful guidelines is not always bad. It's just important to make sure you understand the effect it has.
With all that said, if you are looking for a way to implement reliable, consistent one-time delivery of a resource over HTTP, it's well worth reading Bill de hÓra's HTTPLR spec (http://www.dehora.net/doc/httplr/draft-httplr-01.html). This approach relies on the client acknowledging receipt of the message. You might be able to use something like to allow this user agents that are unaware of the one-use policy (spiders etc.) to GET the resource without causing side effects, but still allow participating clients to cause the resource to become unavailable after one GET.
A transactional approach like this has the added benefit of allowing the client to re-try the download as often as they need to. This is important because otherwise the server cannot know whether the client successfully received the message or not.
If you really need to enforce the once-only policy from the server side for any possible user agent, then your original approach might be best, but bear in mind it's really an "at most once" policy.

Sometimes breaking the spec is the only way, an example is web-page visit-counters that use a hidden image. Is requested with GET but updates a counter.
However some things can go wrong. Applications that follow the spec are allowed to presume that making a GET request won't have any side effects. So is perfectly valid for example for some kind of antivirus-enabled email server to follow the links found in an email to make sure all is safe. If you send this "download-once" link in an email the recipient could never see it. For same reason also a yes-no answer with two different links in an email is hard to deploy. But also in a web page: I recall Google browsing the links of a unique-by-user page known to google only because there was an analytics script inside and because the page contained these infamous links with side effects google was actually changing the answers of people that visited it...
Fake hits are not really a problem in the case of the hidden image counter , they are in any case not considered very reliable, but in the case of the "download-once" link could be problematic.

Related

Modelling a endpoint for a REST API that need save data for every request

A few time ago I participate from a interview where had a question about REST modelling, and how the best way to implement it. The question was:
You have an REST API where you expose a method to consult the distance between two point, although you must save each request to this method to expose the request history.
And I was questioned about which HTTP method should be used on this case, for me the logic answer in that moment was the GET method (to execute the both actions). After this the interviewer asked me why, because since we are also storing the request, this endpoint is not idempotent anymore, after that I wasn't able to reply it. Since this stills on my mind, so I decided to verify here and see others opinions about which method should be used for this case (or how many, GET and POST for example).
You have an REST API where you expose a method to consult the distance between two point, although you must save each request to this method to expose the request history.
How would you do this on the web? You'd probably have a web page with a form, and that form would have input controls to collect the start and end point. When you submit the form, the browser would use the data in the controls, as well as the form metadata and standard HTML processing rules to create a request that would be sent to the server.
Technically, you could use POST as the method of the form. It's completely legal to do that. BUT, as the semantics of the request are "effectively read only", a better choice would be to use GET.
More precisely, this would mean having a family of similar resources, the representation of which includes information about the two points described in the query string.
That family of similar resources would probably be implemented on your origin server as a single operation/route, with a parser extracting the two points from the query string and passing them along to the function as arguments.
the interviewer asked me why, because since we are also storing the request, this endpoint is not idempotent anymore
This is probably the wrong objection - the semantics of GET requests are safe (effectively read only). So the interview might argue that saving the request history is not read only. However, this objection is invalid, because the semantic constraints apply to the request message, not the implementation.
For instance, you may have noticed that HTTP servers commonly add an entry to their access log for each request. Clearly that's not "read only" - but it is merely an implementation detail; the client's request did not say "and also log this".
GET is still fine here, even though the server is writing things down.
One possible objection would be that, if we use GET, then sometimes a cache will return an previous response rather than passing the request all the way through to the origin server to get logged. Which is GREAT - caches are a big part of the reason that the web can be web scale.
But if you don't want caching, the correct way to handle that is to add metadata to the response to inhibit caching, not to change the HTTP method.
Another possibility, which is more consistent with the interviewer's "idempotent" remark, is that they wanted this "request history" to be a resource that the client could edit, and that looking up distances would be a side effect of that editing process.
For instance, we might have some sort of an "itinerary" resource with one or more legs provided by the client. Each time the client modifies the itinerary (for example, by adding another leg), the distance lookup method is called automatically.
In this kind of a problem, where the client is (logically) editing a resource, the requests are no longer "effectively read only". So GET is off the table as an option, and we have to look into the other possibilities.
The TL;DR version is that POST would always be acceptable (and this is how we would do it on the web), but you might prefer an API style where the client edits the representation of the resource locally, in which case you would let the client choose between PUT and PATCH.

Using GET verb to update in rest api?

I know the use of http verbs is based on standard specification. But my question if I use "GET" for update operations and write a code logic to update, does it create issues in any scenario? Apart from the standard, what else could be the reason to use these verbs for a specific purpose only?
my question if I use "GET" for update operations and write a code logic to update, does it create issues in any scenario?
Yes.
A simple example - suppose the network between the client and the server is unreliable; specifically, for a time, HTTP responses are being lost. A general purpose component (like a web proxy) might time out, and then, noticing that the method token of the request is GET, resend the request a second/third/fourth time, with your server performing its update on every GET request.
Let us further assume that these multiple update operations lead to an undesirable outcome; where do we properly affix blame?
Second example: you send someone a copy of the link to the update operation, so that they can send you a request at the appropriate time. But suppose you send that link to them in an email, and the email client recognizes the uri and (as a performance optimization) pre-fetches the link, triggering your update operation too early. Where do we properly affix the blame?
HTTP does not attempt to require the results of a GET to be safe. What it does is require that the semantics of the operation be safe, and therefore it is a fault of the implementation, not the interface or the user of that interface, if anything happens as a result that causes loss of property -- Fielding, 2002
In these, and other examples, blame is correctly affixed to your server, because GET has a standardized meaning which include the constraint that the semantics of the request are safe.
That's not to say that you can't have side effects when handling a GET request; "hit counters" are almost as old as the web itself. You have a lot of freedom in your implementation; so long as you respect the uniform interface, there won't be too much trouble.
Experience report: one of our internal tools uses GET requests to trigger scheduling; in our carefully controlled context (which is not web scale), we get away with it, and have for a very long time.
To borrow your language, there are certainly scenarios that would give us problems; but given our controls we manage to avoid them.
I wouldn't like our chances, though, if requests started coming in from outside of our carefully controlled context.
I think it's a decent question. You're asking a hypothetical: is there any value to doing the right other than that's we agree to use GET for fetching? e.g.: is there value beyond the fact that it's 'semantically nice'. A similar question in HTML might be: "Is it ok to use a <div> with an onclick instead of a <button>? (the answer is no).
There certainly is. Clients, servers and intermediates all change their behavior depending on what method is used. Even if your server can process GET for updates, and you build a client that uses this, your browser might still get confused.
If you are interested in this subject, don't ask on a forum; read the spec. The HTTP specification tells you what clients, servers and proxies should do when they encounter certain methods, statuses and headers.
Start at RFC7231

How to manage HATEOAS links when the server is the client?

I'm learning about HATEOAS. The backend server I'm working on will use a third party REST API that uses HATEOAS. That API has an end point to return the url for each resource and also returns the related resource links with regular requests.
But I'm wondering what's a good way to manage these links on the server to avoid hardcoding them. For example if the third party changes the url of the resource, how will the server detect that change? Are there any standard practices for managing HATEOAS resource links?
Possible ways I can think of
When the server starts, get all the resources urls and cache them. Whenever the third party API needs to be called, reuse these cached urls. Whenever there is a 404 or related error, update the resource url. Or update the url periodically in intervals.
Get the resource url each time before calling the end point. Simplest but essentially doubles the number of requests.
But neither sound like robust ways.
While discovery is generally a good thing and should allow a HATEOAS system to introduce changes in ways that 'hardcoded urls' don't, if urls start breaking arbitrarily I would still consider this a major issue.
You should be able to store urls / links on your side and have some expectation that those keep working.
There are some mechanisms that deal with changes though:
The server should return 301 / 308 redirects if a resource moved. If this were the case, you should also update your references.
The server can emit Sunset or Deprecated headers. See: https://www.rfc-editor.org/rfc/rfc8594
Those are more general answers, but ultimately the existence of best practices does not mean that vendors will abide by them. With that in mind I think your best bet is to try and find out what the deprecation policy is of your vendor and see what they recommend.
Use a cached resource if it is valid, request a refresh when you don't have a local valid copy.
RFC 7234 defines the caching semantics of HTTP.
Ideally, you don't implement the caching rules yourself, but instead you use a general purpose cache.
In its ideal form, your bespoke implementation is talking to a headless browser, and the headless browser worries about the caching rules for you.
In theory, you need the initial URL to start the process, and everything else comes from that.
Each resource you get from the server should include links to other edges on the graph of service for that resource.
So, once you get the initial resource, all of the rest come automatically.
That said, it's not untoward to have "well known" entry points that are, ideally, unchanging URLs. But in the end, those are just "bookmarks", and not necessarily guaranteed end points.
Consider a shopping site such as Amazon. Outside of amazon.com, you don't know any of their URLs. They're all provided on the various forms and pages, and the human simply navigates the site. Those URLs can be changing all the time, and no one would know. With HATEOAS, it's up to the machine to follow the links, rather than a human. But the process of navigation is the same.
As others have mentioned, idea of caching a root resource has merit. Then you rely on the caching headers to direct you to how often you have to refresh the links.
But that said, operationally, there's no difference between following a normal link, and following a cached link. Underneath, the cached resource loads faster, but you still need to "follow the link". Because that's where the caching behavior kicks in. This is different from assuming the link is good, assuming you know the result of a resource lookup. Your application follows the link. Always. The underlying infrastructure is responsible for making it efficient.
So, your code should not, say, load up a root resource, and then stuff a map filled with links, and then assume they're good. Rather, the code should request the root resource, perhaps as a Map of links (datatypes for the win), and let the next layer handle the details. Because it all depends on the type of caching involved. Some have coded durations where no followup is necessary. Others, you make the request anyway, and the server tier responds back "nothing changed", so you can use your local copy, but you're still require to ask in the first place.
Those are implementation details that the SERVER mandates (not the client). It's a server contract. If they want you pinging them each and every time, so be it. That's the contract they're presenting to you and if you want to be a Good Citizen, then you should honor that contact.
Ideally, the server makes good decisions on these kinds of issues for the sake of efficiency, but in the end it's really up to them.
The client has to go along. The client in a HATEOAS system cedes a lot to the server. They're simply not decisions for the client to make.

Why does the GitHub API use a PUT request for starring a repository?

I am developing my own REST API and was looking for other well-established APIs to see what they do when they have a need to expose some sort of action that can be made on a resource. One that came up was the ability to star and unstar a repository/gist on GitHub. According to their docs, you can star with PUT /gists/{gist_id}/star, and unstar with DELETE /gists/{gist_id}/star.
This is what the docs say about the HTTP verbs:
PUT Used for replacing resources or collections. For PUT requests with no body attribute, be sure to set the Content-Length header to zero.
DELETE Used for deleting resources.
Delete makes sense to me, but why would PUT be used? Since you can GET /gists/{gist_id}/star it seems "star" is some sort of functional resource. So I guess I'm just wondering why PUT instead of POST?
So I guess I'm just wondering why PUT instead of POST?
PUT, in HTTP, has somewhat tighter constraints on its semantics than POST does -- for instance, the semantics of PUT are idempotent, which is a convenient thing to know when you are sending requests over an unreliable network; PUT tells you that there won't be a problem if the server receives more than one copy of the request message.
(This is a lot like asking about why GET instead of POST, except that the differences are smaller)
Which is why when you have a simple remote authoring use case, like uploading a document to a document store, that PUT is the superior choice -- because it allows general purpose clients (like browsers) to do useful things without needing extra out of band information.
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#http-verbs does NOT define the semantics of the HTTP methods. The standardized definitions of GET, PUT, POST and so on are in RFC 7231.
If you review the RFC, you'll discover that the semantics of HTTP PUT also cover "create":
The PUT method requests that the state of the target resource be created or replaced with the state defined by the representation enclosed in the request message payload.
"Make your copy of this document look like my copy" is a perfectly reasonable way to communicate information to the server, and the meaning of that message doesn't really depend at all on whether or not the server already knows about the document when the request is processed.
I am developing my own REST API
Do make sure you review Jim Webber's 2011 talk, which I think goes a long way toward clarifying how web based API are "supposed" to work.
One reason I could see is that the star “exists”, and these routes are just toggling some sort of “active” property on the star. So you wouldn’t use POST because you’re not creating the star, just toggling its active property.
Edit: this is also just a guess based on how I’d implement something like this, as their docs are kind of sparse.

Should this be a GET or a PATCH request?

if i do a request to get some data from a database without sending any updates, however i'm marking the record in the database to say the data has been fetched, does that make it a PATCH request or a GET?
Short answer: No, it is still a GET.
RFC 7231 defines safe
Request methods are considered "safe" if their defined semantics are essentially read-only....
This definition of safe methods does not prevent an implementation
from including behavior that is potentially harmful, that is not
entirely read-only, or that causes side effects while invoking a safe
method. What is important, however, is that the client did not
request that additional behavior and cannot be held accountable for
it. For example, most servers append request information to access
log files at the completion of every response, regardless of the
method, and that is considered safe even though the log storage might
become full and crash the server. Likewise, a safe request initiated
by selecting an advertisement on the Web will often have the side
effect of charging an advertising account.
So if the client is trying to retrieve a current representation of the resource,
the fact that your implementation happens to do a bit of bookkeeping on the side doesn't change the semantics of the request.
Part of the point of an HTTP front end is that clients are completely insulated from the underlying implementation details of the server -- everything looks like a dumb web site from the outside.
The HTTP spec is rather clear on that if you read through the definition of safe:
Request methods are considered "safe" if their defined semantics are essentially read-only; i.e., the client does not request, and does not expect, any state change on the origin server as a result of applying a safe method to a target resource. Likewise, reasonable use of a safe method is not expected to cause any harm, loss of property, or unusual burden on the origin server.
This definition of safe methods does not prevent an implementation from including behavior that is potentially harmful, that is not entirely read-only, or that causes side effects while invoking a safe method. What is important, however, is that the client did not request that additional behavior and cannot be held accountable for it. For example, most servers append request information to access log files at the completion of every response, regardless of the method, and that is considered safe even though the log storage might become full and crash the server. Likewise, a safe request initiated by selecting an advertisement on the Web will often have the side effect of charging an advertising account.
...
So a state change through a GET triggered download is fine as long as the client is not aware of that state change.
In certain situations though, exposing a state change via GET may be risky. Just think of a crawler that invokes a couple of URIs that order some Pizza or the like. According to the spec this is fine and the crawler must not made accountable for that order. This is simply telling you that it was your fault.
With that being said, you can always use POST if you feel uncomfortable with certain HTTP operations as POST literally allows you to process the request according to the resources own semantics.
Which leads me to the next point of re-thinking your design. Returning some document that includes it own state is somehow strange in my opinion. Usually such information is meta-data about a document but not the resource itself. Here you could either use HTTP headers to communicate such information to the client or design the state of that resource as yet a further resource you can hint a client about providing it a link to look it up if it is interested.
Anyway, while not elegant performing a state change on retrieving a resource via GET is not forbidden. I would though invest a couple more thoughts on whether you want to include the state within the resource itself or expose it via its own resource.