REST api pagination: make page-size a parameter (configurable from outside) - rest

we are having a search/list-resource:
http://xxxx/users/?page=1
Internally the page-size is static and returns 20 items. The user can move forward by increasing the page number. But to be more flexible we are now thinking also to expose the size of a page:
http://xxxx/users/?page=1&size=20
As such this is flexible as the client can now decide network-calls vs. size of response, when searching. Of course this has the drawback that the server could be hit hard either by accident or maliciosly on purpose:
http://xxxx/users/?page=1&size=1000000
For robustness the solution could be to configure an upper limit of page size (e.g. 100) and when it is exceeded either represent an error response or a HTTP redirect to the URL with highest possible page-size parameter.
What do you think?

Personally, I would simply document a maximum page size, and anything larger than that is simply treated as the maximum.

Managing access to resources is always a good idea aka protecting outside interfaces: in other words, put a sensible limit.
The redirect might be a good idea when it comes to development time i.e. when the user of the API gets acquainted with the service but outside of this situation, I doubt there is value.
Make sure the parameters are well documented either way.

Have you tested this to see if it's even a concern? If a user asks for a page size of a million does that really cause all your other requests to stop/slow? If so I might look at your underlying architecture first. But if, in the end, this is an issue I don't think setting a hard limit on page size is bad.
Question: When I user GETs the URI http://xxx/user?page=1 does the response have a link in it to the next page? previous page? If not then it's not really RESTful.

Related

How to manage HATEOAS links when the server is the client?

I'm learning about HATEOAS. The backend server I'm working on will use a third party REST API that uses HATEOAS. That API has an end point to return the url for each resource and also returns the related resource links with regular requests.
But I'm wondering what's a good way to manage these links on the server to avoid hardcoding them. For example if the third party changes the url of the resource, how will the server detect that change? Are there any standard practices for managing HATEOAS resource links?
Possible ways I can think of
When the server starts, get all the resources urls and cache them. Whenever the third party API needs to be called, reuse these cached urls. Whenever there is a 404 or related error, update the resource url. Or update the url periodically in intervals.
Get the resource url each time before calling the end point. Simplest but essentially doubles the number of requests.
But neither sound like robust ways.
While discovery is generally a good thing and should allow a HATEOAS system to introduce changes in ways that 'hardcoded urls' don't, if urls start breaking arbitrarily I would still consider this a major issue.
You should be able to store urls / links on your side and have some expectation that those keep working.
There are some mechanisms that deal with changes though:
The server should return 301 / 308 redirects if a resource moved. If this were the case, you should also update your references.
The server can emit Sunset or Deprecated headers. See: https://www.rfc-editor.org/rfc/rfc8594
Those are more general answers, but ultimately the existence of best practices does not mean that vendors will abide by them. With that in mind I think your best bet is to try and find out what the deprecation policy is of your vendor and see what they recommend.
Use a cached resource if it is valid, request a refresh when you don't have a local valid copy.
RFC 7234 defines the caching semantics of HTTP.
Ideally, you don't implement the caching rules yourself, but instead you use a general purpose cache.
In its ideal form, your bespoke implementation is talking to a headless browser, and the headless browser worries about the caching rules for you.
In theory, you need the initial URL to start the process, and everything else comes from that.
Each resource you get from the server should include links to other edges on the graph of service for that resource.
So, once you get the initial resource, all of the rest come automatically.
That said, it's not untoward to have "well known" entry points that are, ideally, unchanging URLs. But in the end, those are just "bookmarks", and not necessarily guaranteed end points.
Consider a shopping site such as Amazon. Outside of amazon.com, you don't know any of their URLs. They're all provided on the various forms and pages, and the human simply navigates the site. Those URLs can be changing all the time, and no one would know. With HATEOAS, it's up to the machine to follow the links, rather than a human. But the process of navigation is the same.
As others have mentioned, idea of caching a root resource has merit. Then you rely on the caching headers to direct you to how often you have to refresh the links.
But that said, operationally, there's no difference between following a normal link, and following a cached link. Underneath, the cached resource loads faster, but you still need to "follow the link". Because that's where the caching behavior kicks in. This is different from assuming the link is good, assuming you know the result of a resource lookup. Your application follows the link. Always. The underlying infrastructure is responsible for making it efficient.
So, your code should not, say, load up a root resource, and then stuff a map filled with links, and then assume they're good. Rather, the code should request the root resource, perhaps as a Map of links (datatypes for the win), and let the next layer handle the details. Because it all depends on the type of caching involved. Some have coded durations where no followup is necessary. Others, you make the request anyway, and the server tier responds back "nothing changed", so you can use your local copy, but you're still require to ask in the first place.
Those are implementation details that the SERVER mandates (not the client). It's a server contract. If they want you pinging them each and every time, so be it. That's the contract they're presenting to you and if you want to be a Good Citizen, then you should honor that contact.
Ideally, the server makes good decisions on these kinds of issues for the sake of efficiency, but in the end it's really up to them.
The client has to go along. The client in a HATEOAS system cedes a lot to the server. They're simply not decisions for the client to make.

Paginated REST API : How to select amount of data returned?

I know that from most of today's REST APIs, web calls responses have to be paginated.
But, I don't see on the web any insight on how to select the ideal size of a batch returned by an API call: should it be 10, 100, 1000. To be short: on what factors should we base the reflection of the size of an API response?
Some people state that it should be based on the number of elements displayed by the UI. I don't agree with this, as not all APIs are directly linked with an UI, and, in any cases, modern REST APIs allow to chose the number of items in output batch with a configurable parameter up to a certain amount.
So, how could we define the value for this "maximum number of elements returned by an HTTP request"?
Should it be based on the payload size? Internal architecture of the API? Should it be based on performance measurement?
Any insight on this? I'm not really looking for an explicit figure, but more some techniques that could help to find the answer. The best for me would be the process followed by some successful APIs. But, I cannot find any.
My general approach to this is to:
Try to avoid paging
Make REST clients resilient against paging changes, by using next and previous links, instead of using a ?page= attribute. A REST client knows another page is available strictly by the appearance of the link.
I don't use hard rules or measurements to figure out when paging is needed. Paging is generally painful, so my first approach would be to try to figure out what requirements drives the need for paging, and if that requirement can be removed.
Once I've determined it's not possible to remove this requirement in another way, I would set the cut-off of a page as large as reasonable, to remove the likelyhood clients need to do additional requests.
If it's a closed API (used only by clients you control), pick whatever the UI wants. It's trivial to change. If clients can choose from several options, you can include a pageSize parameter. Or, better..
If it's an open API (used by clients you don't control), then let clients control what size paging they want. Rather than support a pageNumber parameter, support offset and limit. Offset is how many records to skip before starting to return records, and limit is the maximum number of records to return. If the client is not happy with how their request performs, they can adjust the parameters to suit their needs. Any upper limit your API has should be driven by what your service can handle. It's neither possible nor desirable for you to try to figure out the Magic Maximum Page Size that makes all clients happy and no clients sad.
Also, please note that none of this has anything to do with ReST, which is silent when it comes to paging.
I usually make a rough performance measurement by hand. I want as few requests as necessary, but do not want to risk timeouts.

Are single use download links in accordance with HTTP spec?

In the context of a restful web service, is it acceptable to have side effects for GET methods?
Single use download links for example
GET /downloads/664d92b3-b373-4dac-a4fb-7a41d015109a
will return 200 and "the thing" and 404 on next request.
HTTP spec says GET methods should be safe and according to https://www.rfc-editor.org/rfc/rfc7231#section-4.2.1
Request methods are considered "safe" if their defined semantics are
essentially read-only; i.e., the client does not request, and does
not expect, any state change on the origin server as a result of
applying a safe method to a target resource.
and
This definition of safe methods does not prevent an implementation
from including behavior that is potentially harmful, that is not
entirely read-only, or that causes side effects while invoking a safe
method. What is important, however, is that the client did not
request that additional behavior and cannot be held accountable for
it.
Several clarifying examples are provided which make me think safe methods are not allowed to purposefully remove the resource.
For example, most servers append request information to access
log files at the completion of every response, regardless of the
method, and that is considered safe even though the log storage might
become full and crash the server.
And
Likewise, a safe request initiated
by selecting an advertisement on the Web will often have the side
effect of charging an advertising account.
And
For example, it is
common for Web-based content editing software to use actions within
query parameters, such as "page?do=delete". If the purpose of such a
resource is to perform an unsafe action, then the resource owner MUST
disable or disallow that action when it is accessed using a safe
request method.
Single use links are obviously a reality. I just wonder whether they're abusing the spec or I just don't get it.
Having an opinion is fine but having worked on these specs and understanding their subtleties would be most convincing.
What you're suggesting is acceptable in some situations, and not necessarily an abuse of the spec.
Firstly, 2616 says regarding safe methods that they:
SHOULD NOT have the significance of taking an action other than
retrieval
And the phrase "SHOULD NOT" is defined as follows (emphasis added):
This phrase, or the phrase "NOT RECOMMENDED" mean that there may
exist valid reasons in particular circumstances when the particular
behavior is acceptable or even useful, but the full implications
should be understood and the case carefully weighed before
implementing any behavior described with this label.
The new version you linked to (which I think supercedes 2616) doesn't use the term "SHOULD NOT" - but they haven't replaced it with "MUST NOT" either. They also acknowledge that side effects are not ruled out as long as the client is not held responsible. So I think the idea of safe methods is the same.
So since the spec acknowledges that there are situations where it's ok, how do we know if yours is such a situation - and more importantly, how do we stay generally within the "spirit" of the spec i.e. make sure we're not abusing it?
I'd refer to this quote from 7231:
The purpose of distinguishing between safe and unsafe methods is to
allow automated retrieval processes (spiders) and cache performance
optimization (pre-fetching) to work without fear of causing harm.
If your app is a private intranet app and you're not concerned with the issues mentioned here, your approach is ok. Put another way: taking into consideration all the possible ways that a GET could happen, are you ok with this side effect?
Working outside RESTful guidelines is not always bad. It's just important to make sure you understand the effect it has.
With all that said, if you are looking for a way to implement reliable, consistent one-time delivery of a resource over HTTP, it's well worth reading Bill de hÓra's HTTPLR spec (http://www.dehora.net/doc/httplr/draft-httplr-01.html). This approach relies on the client acknowledging receipt of the message. You might be able to use something like to allow this user agents that are unaware of the one-use policy (spiders etc.) to GET the resource without causing side effects, but still allow participating clients to cause the resource to become unavailable after one GET.
A transactional approach like this has the added benefit of allowing the client to re-try the download as often as they need to. This is important because otherwise the server cannot know whether the client successfully received the message or not.
If you really need to enforce the once-only policy from the server side for any possible user agent, then your original approach might be best, but bear in mind it's really an "at most once" policy.
Sometimes breaking the spec is the only way, an example is web-page visit-counters that use a hidden image. Is requested with GET but updates a counter.
However some things can go wrong. Applications that follow the spec are allowed to presume that making a GET request won't have any side effects. So is perfectly valid for example for some kind of antivirus-enabled email server to follow the links found in an email to make sure all is safe. If you send this "download-once" link in an email the recipient could never see it. For same reason also a yes-no answer with two different links in an email is hard to deploy. But also in a web page: I recall Google browsing the links of a unique-by-user page known to google only because there was an analytics script inside and because the page contained these infamous links with side effects google was actually changing the answers of people that visited it...
Fake hits are not really a problem in the case of the hidden image counter , they are in any case not considered very reliable, but in the case of the "download-once" link could be problematic.

Hierarchical RESTful urls still preferred - in terms of added overhead - over flat urls?

Let's say I have a website where users can upload and show their pictures. A RESTful url to a single picture of that user would look like:
http://api.gallery.com/users/{user-id}/images/{image-id}
But the image-id itself is already unique, so this url would already be good enough to get it:
http://api.gallery.com/images/{image-id}
From the REST point of view, the first one would be favorable, but then I should verify, that this image is really from this user, because somebody could alter the url, changing the user-id to somebody else's. In the latter case I wouldn't need to add this check, meaning less processing time.
Is being RESTful in such a case still preferred?
In short, both are preferred; Both may return the same "thing" but the "context" is different.
Let's take a look at your URLs:
/users: all users
/users/1: user #1
/users/1/images: all user #1's images
/users/1/images/1: user #1's image #1
All of the above URLs revolve around the "user" resource. It's "all users", "a user", "a user's images", etc.
/images: all images
/images/1: image #1
All of the above URLs revolve around the "image" resource. It's "all images" or "an image".
Now, on the surface, that distinction may seem relatively minor, but when building an API, the difference can have a large impact on how data is consumed.
For example, let's say that you want to get a list of all of user #1's images, which is preferred?
/users/1/images
or
/images?where=user.id eq 1
The first represents exactly what we want, is more constrained, and easier to understand, however, it doesn't mean we shouldn't also support the second form as the ability to query can be quite useful.
Now, what about if you want to get a list of images, with their associated user?
/users/???
or
/images?include=user
In this scenario, the first URL doesn't make much sense at all, as we are trying to get a list of images, not users, while, the second URL represents exactly what we want.
Now, with regards to security, that should ideally be done in a manner that is completely transparent to the consumer. A consumer should be able to say "I want all images." and only ever recieve all the images that they have access to. If they try to access a specific resource that they do not have access to, an appropriate HTTP error code should be returned.
I think that the second is more RESTful for the reason you state. The URL is a hierarchy. The user-id is not really part of the identification of the image, so why make it part of the identifier?
Make an /users/{user-id}/images resource that returns a list of URLs in the form /images/{image-id} to list the images that the user has uploaded, and you have the best of both worlds.
I think it all depends on semantics and intentions. It seems like you're talking about a protected resource, not a publicly available one. In this case your communication is more explicit and has the least amount of surprise when a more verbose format is used:
http://api.gallery.com/users/{user-id}/images/{image-id}
If it's a public resource then it can be identified by image-id only and then the shorter format would be more logical:
http://api.gallery.com/images/{image-id}

Connectedness & HATEOAS

It is said that in a well defined RESTful system, the clients only need to know the root URI or few well known URIs and the client shall discover all other links through these initial URIs. I do understand the benefits (decoupled clients) from this approach but the downside for me is that the client needs to discover the links each time it tries access something i.e given the following hierarchy of resources:
/collection1
collection1
|-sub1
|-sub1sub1
|-sub1sub1sub1
|-sub1sub1sub1sub1
|-sub1sub2
|-sub2
|-sub2sub1
|-sub2sub2
|-sub3
|-sub3sub1
|-sub3sub2
If we follow the "Client only need to know the root URI" approach, then a client shall only be aware of the root URI i.e. /collection1 above and the rest of URIs should be discovered by the clients through hypermedia links. I find this cumbersome because each time a client needs to do a GET, say on sub1sub1sub1sub1, should the client first do a GET on /collection1 and the follow link defined in the returned representation and then do several more GETs on sub resources to reach the desired resource? or is my understanding about connectedness completely wrong?
Best regards,
Suresh
You will run into this mismatch when you try and build a REST api that does not match the flow of the user agent that is consuming the API.
Consider when you run a client application, the user is always presented with some initial screen. If you match the content and options on this screen with the root representation then the available links and desired transitions will match nicely. As the user selects options on the screen, you can transition to other representations and the client UI should be updated to reflect the new representation.
If you try and model your REST API as some kind of linked data repository and your client UI as an independent set of transitions then you will find HATEOAS quite painful.
Yes, it's right that the client application should traverse the links, but once it's discovered a resource, there's nothing wrong with keeping a reference to that resource and using it for a longer time than one request. If your client has the possibility of remembering things permanently, it can do so.
consider how a web browser keeps its bookmarks. You probably have maybe ten or a hundred bookmarks in the browser, and you probably found some of these deep in a hierarchy of pages, but the browser dutifully remembers them without requiring remembering the path it took to find them.
A more rich client application could remember the URI of sub1sub1sub1sub1 and reuse it if it still works. It's likely that it still represents the same thing (it ought to). If it no longer exists or fails for any other client reason (4xx) you could retrace your steps to see if you can find a suitable replacement.
And of course what Darrel Miller said :-)
I don't think that that's the strict requirement. From how I understand it, it is legal for a client to access resources directly and start from there. The important thing is that you do not do this for state transitions, i.e. do not automatically proceed with /foo2 after operating on /foo1 and so forth. Retrieving /products/1234 initially to edit it seems perfectly fine. The server could always return, say, a redirect to /shop/products/1234 to remain backwards compatible (which is desirable for search engines, bookmarks and external links as well).