How to avoid leaks when paging in a RESTful web service? - rest

We have a RESTful web service that returns a collection of tickets. Because it's possible for the collection returned to be too large to be processed in a single gulp, we've added offset and limit query parameters. The idea is that we run the query, then skip the first offset records, then return the next limit records.
The problem is that this can leak tickets.
Suppose, for example, that there are eight tickets that need work, at the time the client first queries:
ID STATUS
00 needs work
01 needs work
02 needs work
03 needs work
04 needs work
05 needs work
06 needs work
07 needs work
If the client requests the tickets that need work, with an offset of 0 and a limit of 4, we'll return:
ID STATUS
00 needs work
01 needs work
02 needs work
03 needs work
If someone else, then does some work, changing some tickets to:
01 doesn't need work
02 doesn't need work
If the client then requests the tickets that need work, with an offset of 4 and a limit of 4, the results of the query will be:
ID STATUS
00 needs work
03 needs work
04 needs work
05 needs work
06 needs work
07 needs work
And after we skip the first four records, we'll return:
06 needs work
07 needs work
And tickets 04 and 05 will have been skipped.
If we go back to the ticket table on every subsequent paging request, we'll leak tickets, if tickets on earlier pages have been changed so that they fall out of the query results.
Part of me is wondering how important this is.
The client is going to request the needs work tickets on some sort of schedule. When there are more tickets than the limit, it will then page through the rest in multiple calls, incrementing offset on each call. If we do nothing, we will sometimes leak needs work tickets, but they will be picked up the next time the client requests new needs work tickets.
That is, the leaked tickets will only be leaked on this pass, they'll show up on the next.
But if it is important that we not leak tickets, I don't see any way of resolving it other than saving the identifiers of all of the needs work tickets during the first call, and then paging through the collection of identifiers, rather than through the tickets themselves.
We could, for example, when the client requests needs work tickets with an offset of zero, populate a second table with the ids of all of the tickets that need work, then return the first limit tickets that are in the second table. The next call, we use offset and limit against the second table, to determine which tickets to return.
The problem with this is we need to deal with multiple clients running simultaneously. So we need a primary key on the second table that we can match against a specific client, based on what is in the request.
I'd like to be able to manage this without putting additional burden on the client programmers. But I don't see how.
Is there any way for me to tell, by examining a request and its headers, that it came from the same client as an earlier request? I've not been able to find one.
We're currently returning paging information in the response headers:
Paging-offset: 0
Paging-limit: 4
Paging-returnedCount: 4
Paging-totalRecordCount: 54
What I'm thinking is that we might return a Paging-collection value, when we're paging, which would provide a key value into the second table. We could then require the client to provide the collection value when they make a request with offset != 0.
Does this seem reasonable? Do you think that this would put too great a burden on the client programmers?
How have other people solved this problem? Or do they just ignore it?

Is there any way for me to tell, by examining a request and its headers, that it came from the same client as an earlier request? I've not been able to find one.
You're not supposed to be able to - stateless protocol. In particular, if you are trying to do REST, you want the request to have all of the necessary information so that a new server can answer the request when the original server is busy.
But possibilities include giving each client its own resource to work with. There are a number of different ways you can match the request to the unique resource.
The problem with this is we need to deal with multiple clients running simultaneously.
As a rule, REST works much better if you can provide multiple clients with a common understanding of resources, rather than trying to tailor your representations to each.
Consider: Alice queries for a pile of work, Bob changes something, then Charlie queries for a pile of work. Can you live with it if Charlie gets a representation of the pile that was cached by Alice's query (ie, before Bob's change)? Cuz that's kind of how the web is designed to work....
(It doesn't have to - you can have each response set a bunch of no-cache headers. But it's something you should be thinking about, because it may be trying to tell you that the REST architectural constraints are not a great fit for your problem.)
How have other people solved this problem? Or do they just ignore it?
Well, it's a concurrent modification that's invalidating your iterator, right? Maybe you just pitch a fit and force the client to start over....
You might look into AtomSyndication, and how some services use it
For your case, I'd probably look at turning the problem around; instead of asking the server for N tickets that have some property, I'd look into asking the server for all tickets in some range that have the property. The client can just keep navigating through ranges until it fills its own bucket.
Another way of describing your problem is that you are trying to page through a mutable collection.
If you drop the paging constraint -- each request always fetches N unworked tickets starting from the first one, that's pretty straight forward.
If you drop the mutable constraint -- paging through an immutable list is straight forward. Making a mutable list immutable may be easy -- instead of asking for the latest version of the list, you ask for the version of the list as of some particular point in time. This is a very happy problem to have when using event sourcing.
One thing we've discussed is having one query return a list of ticket IDs, which would be small enough to return all of them, all of the time, then have a second query to return a single ticket given a query.
Another good answer; that's fundamentally the way web pages work -- (relatively) small payload of html, with hyperlinks to the java script, media, and so on that extends the representation.

Related

How to design a REST API to fetch a large (ephemeral) data stream?

Imagine a request that starts a long running process whose output is a large set of records.
We could start the process with a POST request:
POST /api/v1/long-computation
The output consists of a large sequence of numbered records, that must be sent to the client. Since the output is large, the server does not store everything, and so maintains a window of records with a upper limit on the size of the window. Let's say that it stores upto 1000 records (and pauses computation whenever this many records are available). When the client fetches records, the server may subsequently delete those records and so continue with generating more records (as more slots in the 1000-length window are free).
Let's say we fetch records with:
GET /api/v1/long-computation?ack=213
We can take this to mean that the server should return records starting from index 214. When the server receives this request, it can assume that the (well-behaved) client is acknowledging that records up to number 213 are received by the client and so it deletes them, and then returns records starting from number 214 to whatever is available at that time.
Next if the client requests:
GET /api/v1/long-computation?ack=214
the server would delete record 214 and return records starting from 215.
This seems like a reasonable design until it is noticed that GET requests need to be safe and idempotent (see section 9.1 in the HTTP RFC).
Questions:
Is there a better way to design this API?
Is it OK to keep it as GET even though it appears to violate the standard?
Would it be reasonable to make it a POST request such as:
POST /api/v1/long-computation/truncate-and-fetch?ack=213
One question I always feel like that needs to be asked is, are you sure that REST is the right approach for this problem? I'm a big fan and proponent REST, but try to only apply to to situations where it's applicable.
That being said, I don't think there's anything necessarily wrong with expiring resources after they have been used, but I think it's bad design to re-use the same url over and over again.
Instead, when I call the first set of results (maybe with):
GET /api/v1/long-computation
I'd expect that resource to give me a next link with the next set of results.
Although that particular url design does sort of tell me there's only 1 long-computation on the entire system going on at the same time. If this is not the case, I would also expect a bit more uniqueness in the url design.
The best solution here is to buy a bigger hard drive. I'm assuming you've pushed back and that's not in the cards.
I would consider your operation to be "unsafe" as defined by RFC 7231, so I would suggest not using GET. I would also strongly advise you to not delete records from the server without the client explicitly requesting it. One of the principles REST is built around is that the web is unreliable. Under your design, what happens if a response doesn't make it to the client for whatever reason? If they make another request, any records from the lost response will be destroyed.
I'm going to second #Evert's suggestion that you absolutely must keep this design, you instead pick a technology that's build around reliable delivery of information, such as a messaging queue. If you're going to stick with REST, you need to allow clients to tell you when it's safe to delete records.
For instance, is it possible to page records? You could do something like:
POST /long-running-operations?recordsPerPage=10
202 Accepted
Location: "/long-running-operations/12"
{
"status": "building next page",
"retry-after-seconds": 120
}
GET /long-running-operations/12
200 OK
{
"status": "next page available",
"current-page": "/pages/123"
}
-- or --
GET /long-running-operations/12
200 OK
{
"status": "building next page",
"retry-after-seconds": 120
}
-- or --
GET /long-running-operations/12
200 OK
{
"status": "complete"
}
GET /pages/123
{
// a page of records
}
DELETE /pages/123
// remove this page so new records can be made
You'll need to cap out page size at the number of records you support. If the client request is smaller than that limit, you can background more records while they process the first page.
That's just spitballing, but maybe you can start there. No promises on quality - this is totally off the top of my head. This approach is a little chatty, but it saves you from returning a 404 if the new page isn't ready yet.

Avoid duplicate POSTs with REST

I have been using POST in a REST API to create objects. Every once in a while, the server will create the object, but the client will be disconnected before it receives the 201 Created response. The client only sees a failed POST request, and tries again later, and the server happily creates a duplicate object...
Others must have had this problem, right? But I google around, and everyone just seems to ignore it.
I have 2 solutions:
A) Use PUT instead, and create the (GU)ID on the client.
B) Add a GUID to all objects created on the client, and have the server enforce their UNIQUE-ness.
A doesn't match existing frameworks very well, and B feels like a hack. How does other people solve this, in the real world?
Edit:
With Backbone.js, you can set a GUID as the id when you create an object on the client. When it is saved, Backbone will do a PUT request. Make your REST backend handle PUT to non-existing id's, and you're set.
Another solution that's been proposed for this is POST Once Exactly (POE), in which the server generates single-use POST URIs that, when used more than once, will cause the server to return a 405 response.
The downsides are that 1) the POE draft was allowed to expire without any further progress on standardization, and thus 2) implementing it requires changes to clients to make use of the new POE headers, and extra work by servers to implement the POE semantics.
By googling you can find a few APIs that are using it though.
Another idea I had for solving this problem is that of a conditional POST, which I described and asked for feedback on here.
There seems to be no consensus on the best way to prevent duplicate resource creation in cases where the unique URI generation is unable to be PUT on the client and hence POST is needed.
I always use B -- detection of dups due to whatever problem belongs on the server side.
Detection of duplicates is a kludge, and can get very complicated. Genuine distinct but similar requests can arrive at the same time, perhaps because a network connection is restored. And repeat requests can arrive hours or days apart if a network connection drops out.
All of the discussion of identifiers in the other anwsers is with the goal of giving an error in response to duplicate requests, but this will normally just incite a client to get or generate a new id and try again.
A simple and robust pattern to solve this problem is as follows: Server applications should store all responses to unsafe requests, then, if they see a duplicate request, they can repeat the previous response and do nothing else. Do this for all unsafe requests and you will solve a bunch of thorny problems. Repeat DELETE requests will get the original confirmation, not a 404 error. Repeat POSTS do not create duplicates. Repeated updates do not overwrite subsequent changes etc. etc.
"Duplicate" is determined by an application-level id (that serves just to identify the action, not the underlying resource). This can be either a client-generated GUID or a server-generated sequence number. In this second case, a request-response should be dedicated just to exchanging the id. I like this solution because the dedicated step makes clients think they're getting something precious that they need to look after. If they can generate their own identifiers, they're more likely to put this line inside the loop and every bloody request will have a new id.
Using this scheme, all POSTs are empty, and POST is used only for retrieving an action identifier. All PUTs and DELETEs are fully idempotent: successive requests get the same (stored and replayed) response and cause nothing further to happen. The nicest thing about this pattern is its Kung-Fu (Panda) quality. It takes a weakness: the propensity for clients to repeat a request any time they get an unexpected response, and turns it into a force :-)
I have a little google doc here if any-one cares.
You could try a two step approach. You request an object to be created, which returns a token. Then in a second request, ask for a status using the token. Until the status is requested using the token, you leave it in a "staged" state.
If the client disconnects after the first request, they won't have the token and the object stays "staged" indefinitely or until you remove it with another process.
If the first request succeeds, you have a valid token and you can grab the created object as many times as you want without it recreating anything.
There's no reason why the token can't be the ID of the object in the data store. You can create the object during the first request. The second request really just updates the "staged" field.
Server-issued Identifiers
If you are dealing with the case where it is the server that issues the identifiers, create the object in a temporary, staged state. (This is an inherently non-idempotent operation, so it should be done with POST.) The client then has to do a further operation on it to transfer it from the staged state into the active/preserved state (which might be a PUT of a property of the resource, or a suitable POST to the resource).
Each client ought to be able to GET a list of their resources in the staged state somehow (maybe mixed with other resources) and ought to be able to DELETE resources they've created if they're still just staged. You can also periodically delete staged resources that have been inactive for some time.
You do not need to reveal one client's staged resources to any other client; they need exist globally only after the confirmatory step.
Client-issued Identifiers
The alternative is for the client to issue the identifiers. This is mainly useful where you are modeling something like a filestore, as the names of files are typically significant to user code. In this case, you can use PUT to do the creation of the resource as you can do it all idempotently.
The down-side of this is that clients are able to create IDs, and so you have no control at all over what IDs they use.
There is another variation of this problem. Having a client generate a unique id indicates that we are asking a customer to solve this problem for us. Consider an environment where we have a publicly exposed APIs and have 100s of clients integrating with these APIs. Practically, we have no control over the client code and the correctness of his implementation of uniqueness. Hence, it would probably be better to have intelligence in understanding if a request is a duplicate. One simple approach here would be to calculate and store check-sum of every request based on attributes from a user input, define some time threshold (x mins) and compare every new request from the same client against the ones received in past x mins. If the checksum matches, it could be a duplicate request and add some challenge mechanism for a client to resolve this.
If a client is making two different requests with same parameters within x mins, it might be worth to ensure that this is intentional even if it's coming with a unique request id.
This approach may not be suitable for every use case, however, I think this will be useful for cases where the business impact of executing the second call is high and can potentially cost a customer. Consider a situation of payment processing engine where an intermediate layer ends up in retrying a failed requests OR a customer double clicked resulting in submitting two requests by client layer.
Design
Automatic (without the need to maintain a manual black list)
Memory optimized
Disk optimized
Algorithm [solution 1]
REST arrives with UUID
Web server checks if UUID is in Memory cache black list table (if yes, answer 409)
Server writes the request to DB (if was not filtered by ETS)
DB checks if the UUID is repeated before writing
If yes, answer 409 for the server, and blacklist to Memory Cache and Disk
If not repeated write to DB and answer 200
Algorithm [solution 2]
REST arrives with UUID
Save the UUID in the Memory Cache table (expire for 30 days)
Web server checks if UUID is in Memory Cache black list table [return HTTP 409]
Server writes the request to DB [return HTTP 200]
In solution 2, the threshold to create the Memory Cache blacklist is created ONLY in memory, so DB will never be checked for duplicates. The definition of 'duplication' is "any request that comes into a period of time". We also replicate the Memory Cache table on the disk, so we fill it before starting up the server.
In solution 1, there will be never a duplicate, because we always check in the disk ONLY once before writing, and if it's duplicated, the next roundtrips will be treated by the Memory Cache. This solution is better for Big Query, because requests there are not imdepotents, but it's also less optmized.
HTTP response code for POST when resource already exists

REST Webservices - GET but for multiple objects

I have already gone through this
How best to design a REST API with multiple filters?
This does help when you have say 3 or 4 filtering criteria and you can accomodate that in the query String.
However let's take this example
You want to get call details about 20 telephone numbers, between a certain startdate and enddate.
Now I do agree ideally one should be advised to make individual queries for each number and then on the client side collate all data.
However for certain Live systems that would mean 20 rounds of queries on the switches or cdr databases. That is 20 request-response cycles plus the client having to collate and order them again based on time. While in the database level it would have been a simple single query that can return an ordered data and transformed back into a REST xml response that the client can embed on their system.
If we are to use GET the query string will get really confusing and has a limit as well.
Any suggestions to get around this issue.
Of course we can send a POST request with an xml having all numbers in it but that is against REST Get principles.
In case of GET use OData queries. For example when your start and end dates represented as numbers (unix time) URI could look like:
GET http://operatorcalls.com/Calls/Details?$filter=Date le 1342699200 and Date gt 1342526400
What you seem to be missing is an important concept of REST, caching. This can be done, as an example, in the browser, for a single client. Or it can be done as a shared cache between all the clients and the live production system (whatever it may be). Thus reducing queries against a live production system, or in your example, actual switches.
You should really take some time to read Fieldings thesis, and understand that REST is an architectural style.
I found a solution here Handling multiple parameters in a URI (RESTfully) in Java
but not quite happy with it.
So in effect we will end up using /cdr?numbers=number1,number2,number3 ...
However not too pleased with it as there is a limit to Query String in the url and also doesn't really seem to be an elegant solution. Anyone found any solution to this in their own implementation?
Basically not using POST for this kind of Fetch requests and also not using cumbresome and lengthy Query Strings.
We are using Jersey but also open to using CXF or Spring REST

How to implement robust pagination with a RESTful API when the resultset can change?

I'm implementing a RESTful API which exposes Orders as a resource and supports pagination through the resultset:
GET /orders?start=1&end=30
where the orders to paginate are sorted by ordered_at timestamp, descending. This is basically approach #1 from the SO question Pagination in a REST web application.
If the user requests the second page of orders (GET /orders?start=31&end=60), the server simply re-queries the orders table, sorts by ordered_at DESC again and returns the records in positions 31 to 60.
The problem I have is: what happens if the resultset changes (e.g. a new order is added) while the user is viewing the records? In the case of a new order being added, the user would see the old order #30 in first position on the second page of results (because the same order is now #31). Worse, in the case of a deletion, the user sees the old order #32 in first position on the second page (#31) and wouldn't see the old order #31 (now #30) at all.
I can't see a solution to this without somehow making the RESTful server stateful (urg) or building some pagination intelligence into each client... What are some established techniques for dealing with this?
For completeness: my back-end is implemented in Scala/Spray/Squeryl/Postgres; I'm building two front-end clients, one in backbone.js and the other in Python Django.
The way I'd do it, is to make the indices from old to new. So they never change. And then when querying without any start parameter, return the newest page. Also the response should contain an index indicating what elements are contained, so you can calculate the indices you need to request for the next older page. While this is not exactly what you want, it seems like the easiest and cleanest solution to me.
Initial request: GET /orders?count=30 returns:
{
"start"=1039;
"count"=30;
...//data
}
From this the consumer calculates that he wants to request:
Next requests: GET /orders?start=1009&count=30 which then returns:
{
"start"=1009;
"count"=30;
...//data
}
Instead of raw indices you could also return a link to the next page:
{
"next"="/orders?start=1009&count=30";
}
This approach breaks if items get inserted or deleted in the middle. In that case you should use some auto incrementing persistent value instead of an index.
The sad truth is that all the sites I see have pagination "broken" in that sense, so there must not be an easy way to achieve that.
A quick workaround could be reversing the ordering, so the position of the items is absolute and unchanging with new additions. From your front page you can give the latest indices to ensure consistent navigation from up there.
Pros: same url gives the same results
Cons: there's no evident way to get the latest elements... Maybe you could use negative indices and redirect the result page to the absolute indices.
With a RESTFUL API, Application state should be in the client. Here the application state should some sort of time stamp or version number telling when you started looking at the data. On the server side, you will need some form of audit trail, which is properly server data, as it does not depend on whether there have been clients and what they have done. At the very least, it should know when the data last changed. No contradiction with REST here.
You could add a version parameter to your get. When the client first requires a page, it normally does not send a version. The server replies contains one. For instance, if there are links in the reply to next/other pages, those links contains &version=... The client should send the version when requiring another page.
When the server recieves some request with a version, it should at least know whether the data have changed since the client started looking and, dependending of what sort of audit trail you have, how they have changed. If they have not, it answer normally, transmitting the same version number. If they have, it may at least tell the client. And depending how much it knows on how the data have changed, it may taylor the reply accordingly.
Just as an example, suppose you get a request with start, end, version, and that you know that since version was up to date, 3 rows coming before start have been deleted. You might send a redirect with start-3, end-3, new version.
WebSockets can do this. You can use something like pusher.com to catch realtime changes to your database and pass the changes to the client. You can then bind different pusher events to work with models and collections.
Just Going to throw it out there. Please feel free to tell me if it's completely wrong and why so.
This approach is trying to use a left_off variable to sort through without using offsets.
Consider you need to make your result Ordered by timestamp order_at DESC.
So when I ask for first result set
it's
SELECT * FROM Orders ORDER BY order_at DESC LIMIT 25;
right?
This is the case when you ask for the first page (in terms of URL probably the request that doesn't have any
yoursomething.com/orders?limit=25&left_off=$timestamp
Then When receiving your data set. just grab the timestamp of last viewed item. 2015-12-21 13:00:49
Now to Request next 25 items go to: yoursomething.com/orders?limit=25&left_off=2015-12-21 13:00:49 (to lastly viewed timestamp)
In Sql you would just make the same query and say where timestamp is equal or less than $left_off
SELECT * FROM (SELECT * FROM Orders ORDER BY order_at DESC) as a
WHERE a.order_at < '2015-12-21 13:00:49' LIMIT 25;
You should get a next 25 items from the last seen item.
For those who sees this answer. Please comment if this approach is relevant or even possible in the first place. Thank you.

Transient REST Representations

Let's say I have a RESTful, hypertext-driven service that models an ice cream store. To help manage my store better, I want to be able to display a daily report listing quantity and dollar value of each kind of ice cream sold.
It seems like this reporting capability could be exposed as a resource called DailyReport. A DailyReport can be generated quickly, and there doesn't seem to be any advantage to actually storing reports on the server. I only want a DailyReport for some days, other days I don't care about getting a DailyReport. Furthermore, storing DailyReports on the server would complicate client implementations, which would need remember to delete reports they no longer need.
A DailyReport is transient; its representation can be retrieved only once. One way to implement this would be to offer a link "/daily-reports", a POST to which will return a response containing a DailyReport representation listing the information for that day's sales.
Edit: Let's also say that I really do want to do a POST request. A DailyReport has many different options for creating a view such as sorting ice cream types alphabetically, by dollar value - or including an hourly breakdown - or optionally including the temperature for that day - or filtering out certain ice cream types (as a list). Rather than using query parameters with a GET, I'd rather POST a DailyReport representation with the appropriate options (using a well-defined custom media type to document each option). The representation I get back would display my options along with the report itself.
Is this the correct way to think about the problem, or should some other approach be used instead? If correct, what special considerations might be important when implementing the DailyReport resource? (For example, it probably wouldn't be appropriate to set the Location header when returning after a POST request).
If you want to make daily reports for past days available, you could implement it as a GET to /daily_reports/2009/08/20. I agree with John Millikin that a POST is unnecessary here - there's no need for something like this to be a user-creatable resource.
The advantage of making the report for each day available as its own URI is cacheability.
EDIT: A good solution might be to merge the two answers, making daily_report/ a no-cache representation of the current day's data and daily_reports/yyyy/mm/dd a cacheable representation of a full day's data.
There's no need to use a POST for this, since requesting the report doesn't change the state of the server. I would use a resource like this:
GET /daily-report/
200 OK
Pragma: no-cache
<daily-report for="2009-04-20" generated-at="2009-4-20T12:13:14Z">
<!-- contents of the report here -->
</daily-report>
Responding to your edit: if you are POSTing a description of the report to a URL, and retrieving a temporary data set as a result, that's not REST at all. It's RPC, in the same vein as SOAP. RPC is not an inherently bad thing, but please, please don't call it RESTful.
Sometimes it is desirable to keep a record of requests for reports, in those cases it is not unreasonable to POST to a collection resource. It is also useful for long running reports where you want handle the execution asynchronously. How long the server holds onto those report requests is up to you.
I would do something like
POST /DailyReportRequests
which would return a representation of the request, including options, and when the report is completed, a link to the report results.
Another alternative which is good when you have a set of pre-canned reports is to create a DailyReports resource that contains a list of preconfigured report links. The OpenSearchDescription spec allows you to do something similar to this using the Query tag.
I think Greg's approach is the correct one. To expound upon it, I don't think you should provide a /daily-report resource that changes daily, because running the report on Tuesday at 11:59 would yield different results than running it Wednesday at 00:01, which can be A) confusing for clients expecting the resource to be the same, and B) doesn't allow clients to retrieve a previous day's data after the day has passed. You should provide a unique resource identifier for each daily report that's available, that way clients can access the information they need at any time.