How to implement robust pagination with a RESTful API when the resultset can change? - scala

I'm implementing a RESTful API which exposes Orders as a resource and supports pagination through the resultset:
GET /orders?start=1&end=30
where the orders to paginate are sorted by ordered_at timestamp, descending. This is basically approach #1 from the SO question Pagination in a REST web application.
If the user requests the second page of orders (GET /orders?start=31&end=60), the server simply re-queries the orders table, sorts by ordered_at DESC again and returns the records in positions 31 to 60.
The problem I have is: what happens if the resultset changes (e.g. a new order is added) while the user is viewing the records? In the case of a new order being added, the user would see the old order #30 in first position on the second page of results (because the same order is now #31). Worse, in the case of a deletion, the user sees the old order #32 in first position on the second page (#31) and wouldn't see the old order #31 (now #30) at all.
I can't see a solution to this without somehow making the RESTful server stateful (urg) or building some pagination intelligence into each client... What are some established techniques for dealing with this?
For completeness: my back-end is implemented in Scala/Spray/Squeryl/Postgres; I'm building two front-end clients, one in backbone.js and the other in Python Django.

The way I'd do it, is to make the indices from old to new. So they never change. And then when querying without any start parameter, return the newest page. Also the response should contain an index indicating what elements are contained, so you can calculate the indices you need to request for the next older page. While this is not exactly what you want, it seems like the easiest and cleanest solution to me.
Initial request: GET /orders?count=30 returns:
{
"start"=1039;
"count"=30;
...//data
}
From this the consumer calculates that he wants to request:
Next requests: GET /orders?start=1009&count=30 which then returns:
{
"start"=1009;
"count"=30;
...//data
}
Instead of raw indices you could also return a link to the next page:
{
"next"="/orders?start=1009&count=30";
}
This approach breaks if items get inserted or deleted in the middle. In that case you should use some auto incrementing persistent value instead of an index.

The sad truth is that all the sites I see have pagination "broken" in that sense, so there must not be an easy way to achieve that.
A quick workaround could be reversing the ordering, so the position of the items is absolute and unchanging with new additions. From your front page you can give the latest indices to ensure consistent navigation from up there.
Pros: same url gives the same results
Cons: there's no evident way to get the latest elements... Maybe you could use negative indices and redirect the result page to the absolute indices.

With a RESTFUL API, Application state should be in the client. Here the application state should some sort of time stamp or version number telling when you started looking at the data. On the server side, you will need some form of audit trail, which is properly server data, as it does not depend on whether there have been clients and what they have done. At the very least, it should know when the data last changed. No contradiction with REST here.
You could add a version parameter to your get. When the client first requires a page, it normally does not send a version. The server replies contains one. For instance, if there are links in the reply to next/other pages, those links contains &version=... The client should send the version when requiring another page.
When the server recieves some request with a version, it should at least know whether the data have changed since the client started looking and, dependending of what sort of audit trail you have, how they have changed. If they have not, it answer normally, transmitting the same version number. If they have, it may at least tell the client. And depending how much it knows on how the data have changed, it may taylor the reply accordingly.
Just as an example, suppose you get a request with start, end, version, and that you know that since version was up to date, 3 rows coming before start have been deleted. You might send a redirect with start-3, end-3, new version.

WebSockets can do this. You can use something like pusher.com to catch realtime changes to your database and pass the changes to the client. You can then bind different pusher events to work with models and collections.

Just Going to throw it out there. Please feel free to tell me if it's completely wrong and why so.
This approach is trying to use a left_off variable to sort through without using offsets.
Consider you need to make your result Ordered by timestamp order_at DESC.
So when I ask for first result set
it's
SELECT * FROM Orders ORDER BY order_at DESC LIMIT 25;
right?
This is the case when you ask for the first page (in terms of URL probably the request that doesn't have any
yoursomething.com/orders?limit=25&left_off=$timestamp
Then When receiving your data set. just grab the timestamp of last viewed item. 2015-12-21 13:00:49
Now to Request next 25 items go to: yoursomething.com/orders?limit=25&left_off=2015-12-21 13:00:49 (to lastly viewed timestamp)
In Sql you would just make the same query and say where timestamp is equal or less than $left_off
SELECT * FROM (SELECT * FROM Orders ORDER BY order_at DESC) as a
WHERE a.order_at < '2015-12-21 13:00:49' LIMIT 25;
You should get a next 25 items from the last seen item.
For those who sees this answer. Please comment if this approach is relevant or even possible in the first place. Thank you.

Related

Flask:Confused over using PUT/POST in Rest API

Respected Elders/Programmers/Learned-people,
I have read a very popular answer here about using PUT or POST, but I couldn't decipher from it as to what is the correct way to do so. Almost every answer has comments saying this is wrong/this is right. Mighty confused here.
My requirements:
Sending 2 Json files to the server, one to be inserted in Database and other to be updated. I thought I would use PUT to update, and POST to Insert into the database. That way, on the client side itself I would decide whether to insert or update.
Confusion: Since, the client alone is responsible for its data to be created/updated on the server, even POST in my case upon being repeated would insert the same thing (insert into table values) over and over, behaving as idempotent OR would give an error (Because of primary key conflict). Finally, it would not create something new upon firing it twice.
Question: Is it correct to use PUT for updation, and POST for insertion?
PUT can also be used for creating. What's important is the url. This is generally the accepted pattern:
PUT /collection/1234 <- Update a specific item OR create it
POST /collection <- Add a new item to a collection
Which one is right for you depends on a few things. Does the server determine the url of the new item, or does the client?
If the client can figure out what the url of a new item becomes, using PUT might be better because you can more easily turn it into an idempotent request.
Remember that with a PUT request the intent is always that you are replacing the resource at the target url with a new state.
However, if the server creates the url pattern (maybe you have an auto-incrementing id), POST is better. POST doesn't have to be on the parent collection but it's common.
If you want POST and want idempotence, you need some other way to figure out something was a repeated request. You get that for free with PUT. For example, the Stripe API solves this by adding a non-standard Idempotency-Key header.

REST API Get single latest resource

I'm designing a REST api and interested if anyone can help with best practice in the following scenario.
I have...
GET Customers/{customerId}/Orders - to get all customer orders
GET Customers/{customerId}/Orders/{orderId} - to get a particular order
I need to provide the ability to get a customers most recent order. What is best practice in this scenario? Simply get all and sort by date or provide a specific method?
I need to provide the ability to get a customers most recent order.
Of course you could provide query parameters to filter, sort and slice the orders collection, but why not making it simpler and give the latest order if the client needs it?
You could use something like (returning a representation of a single order):
GET /customers/{customerId}/orders/latest
The above URL will map an order that will change over the time and it's perfectly fine.
Say there is also a case where you need last 5 orders. How would your route(s) look like?
The above approach focus on the ability to get a customers most recent order requirement. If returning the last 5 orders requirement eventually comes up after some time, I would probably introduce another mapping such as /recent that returns a representation of a collection with the recent orders and accepts a query parameter that indicates the amount of orders to be returned (5 would be the default value if the parameter is omitted).
The /latest mapping would still be valid and would return a representation of the very latest order only.
Providing query parameters to filter, sort and slice the orders collection is still a valid approach.
The key is: If you know the client who will consume the API, target it to their needs. Otherwise, make it more generic. And when modifying the API, be careful with breaking changes and versioning the API is also welcome.
I think there is no need for another route.
Pass something like &order=-created_at&limit=1 in your get request
Or &order=created_at&orderby=DESC&limit=1 (note I'm not sure about naming your params so maybe you could use &count=1 instead of &limit=1, ditto order params)
I think it also depends whether you are using pagination or not on that route, so perhaps additional params are required
Customers/{customerId}/Orders?order=-created_at&limit=1
The Github API for the similar use case is using latest, to fetch the single resource which is latest.
https://docs.github.com/en/rest/reference/repos#get-the-latest-release
So to fetch a single resource which is latest you can use.
GET /customers/{customerId}/orders/latest
However would like to know what community think about this.
IMO the resource/latest gives an impression that the response will be a list of resource sorted by latest to oldest.

How to optimize collection subscription in Meteor?

I'm working on a filtered live search module with Meteor.js.
Usecase & problem:
A user wants to do a search through all the users to find friends. But I cannot afford for each user to ask the complete users collection. The user filter the search using checkboxes. I'd like to subscribe to the matched users. What is the best way to do it ?
I guess it would be better to create the query client-side, then send it the the method to get back the desired set of users. But, I wonder : when the filtering criteria changes, does the new subscription erase all of the old one ? Because, if I do a first search which return me [usr1, usr3, usr5], and after that a search that return me [usr2, usr4], the best would be to keep the first set and simply add the new one to it on the client-side suscribed collection.
And, in addition, if then I do a third research wich should return me [usr1, usr3, usr2, usr4], the autorunned subscription would not send me anything as I already have the whole result set in my collection.
The goal is to spare processing and data transfer from the server.
I have some ideas, but I haven't coded enough of it yet to share it in a easily comprehensive way.
How would you advice me to do to be the more relevant possible in term of time and performance saving ?
Thanks you all.
David
It depends on your application, but you'll probably send a non-empty string to a publisher which uses that string to search the users collection for matching names. For example:
Meteor.publish('usersByName', function(search) {
check(search, String);
// make sure the user is logged in and that search is sufficiently long
if (!(this.userId && search.length > 2))
return [];
// search by case insensitive regular expression
var selector = {username: new RegExp(search, 'i')};
// only publish the necessary fields
var options = {fields: {username: 1}};
return Meteor.users.find(selector, options);
});
Also see common mistakes for why we limit the fields.
performance
Meteor is clever enough to keep track of the current document set that each client has for each publisher. When the publisher reruns, it knows to only send the difference between the sets. So the situation you described above is already taken care of for you.
If you were subscribed for users: 1,2,3
Then you restarted the subscription for users 2,3,4
The server would send a removed message for 1 and an added message for 4.
Note this will not happen if you stopped the subscription prior to rerunning it.
To my knowledge, there isn't a way to avoid removed messages when modifying the parameters for a single subscription. I can think of two possible (but tricky) alternatives:
Accumulate the intersection of all prior search queries and use that when subscribing. For example, if a user searched for {height: 5} and then searched for {eyes: 'blue'} you could subscribe with {height: 5, eyes: 'blue'}. This may be hard to implement on the client, but it should accomplish what you want with the minimum network traffic.
Accumulate active subscriptions. Rather than modifying the existing subscription each time the user modifies the search, start a new subscription for the new set of documents, and push the subscription handle to an array. When the template is destroyed, you'll need to iterate through all of the handles and call stop() on them. This should work, but it will consume more resources (both network and server memory + CPU).
Before attempting either of these solutions, I'd recommend benchmarking the worst case scenario without using them. My main concern is that without fairly tight controls, you could end up publishing the entire users collection after successive searches.
If you want to go easy on your server, you'll want to send as little data to the client as possible. That means every document you send to the client that is NOT a friend is waste. So let's eliminate all that waste.
Collect your filters (eg filters = {sex: 'Male', state: 'Oregon'}). Then call a method to search based on your filter (eg Users.find(filters). Additionally, you can run your own proprietary ranking algorithm to determine the % chance that a person is a friend. Maybe base it off of distance from ip address (or from phone GPS history), mutual friends, etc. This will pay dividends in efficiency in a bit. Index things like GPS coords or other highly unique attributes, maybe try out composite indexes. But remember more indexes means slower writes.
Now you've got a cursor with all possible friends, ranked from most likely to least likely.
Next, change your subscription to match those friends, but put a limit:20 on there. Also, only send over the fields you need. That way, if a user wants to skip this step, you only wasted sending 20 partial docs over the wire. Then, have an infinite scroll or 'load more' button the user can click. When they load more, it's an additive subscription, so it's not resending duplicate info. Discover Meteor describes this pattern in great detail, so I won't.
After a few clicks/scrolls, the user won't find any more friends (because you were smart & sorted them) so they will stop trying & move on to the next step. If you returned 200 possible friends & they stop trying after 60, you just saved 140 docs from going through the pipeline. There's your efficiency.

REST Design: What Http verb should be used to retrieve a dynamic resource?

I have a scenario in which I have REST API which manages a Resource which we will call Group. A Group contains members and the group resource is dynamic - whenever you retrieve it, you get the latest data (so a query must run server side to update the number of members in a group - in other words, the result of the request is to modify the data, since the results of running the query are stored).
Given a *group_id* it should return a minimal amount of information like
{
group_id: "5t7yu8i9io0op",
group_name: "That's my name",
size: 34
}
So a GET to this resource causes the resource to change, since a subsequent GET could return a new value for 'size'. This tells me it is not idempotent and so you should use POST to retrieve this resource. Am I correct in this conclusion?
If I am correct, do you think it is advisable to also provide a GET method that only returns the currently stored data for the group (eg. so the size could be out of date, even the name too). I suppose in this case I should return a last-modified date as one of the fields so that the user knows how up-to-date the resource is and can then elect to use the POST method...but then I am left wondering why would anyone do that, so why not ONLY provide the POST method and forget about GET?
Confused I am!
Thanks in advance.
[EDIT]
#Satish posted a link in his/her answer to the HTTP specs. In section 9.1.1. it ends with this sentence:
Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.
So in my scenario, the requester does not really care about the side-effect that the value for 'size' is recomputed as a direct result of making the request. They want the group information and it just so happens that to provide accurate, up-to-date group data, the size query must be run in order to update that value. Whilst making the request causes data to change implies this should be a POST, the user did not request that side-effect and so therefore a GET request would be acceptable and more intuitive, would it not? And therefore still be restful according to this sentence.
[2nd EDIT]
#Satish asks a very important question in the comments. So for others who read this I'll explain further about this problem:
Normally you would not run the group query to update its size from a REST request. As members are added or removed from a group, you would update the computed size of that group, store it and then a simple GET request would always return the correct size. However, our situation is more complicated in that a group is only stored as a query definition in ElasticSearch (kind of like a view in an RDBMS). Members do not get added/removed to and from groups. They get added to a much larger set of data (a collection in MongoDB). There are hundreds, potentially thousands, of different 'group definitions' so it is not practical to recompute size for every group when the collection changes. We cannot know when an item is added/removed to/from the collection which groups might change size - you only know by running the group definition who is in that group and what the size is. I hope that clears things up. :)
You should use GET. Even if dynamic resource is changing, you did not request for that change through your request and you are not accountable for that change. Ref: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
In your case when you do a GET you retrieve some information about the Group. You don't modify the group structure. Ok, the group can be changed by an external entity so your next GET may bring you another data. Am I right? Who modifies the structure of the group and when?
So you should use GETbecause the resource it will be modified from somewhere else and not by your call that tries to do a read operation.
EDIT
After your edited the question I just want to add that I agree about the side effects.
It matters if you sent data or a change command explicitly to the server or you just read something and you don't have to pay attention for what the server side is doing to gave you the response. More intuitively:
GET - Requests data from a specified resource
POST - Submits data to be processed to a specified resource
It is the combination of GET and POST. So you should use POST.
Refer : http://adarshdchaurasia.wordpress.com/2013/09/26/http-get-vs-post/
You should not use GET because if you use GET method then search engines may cache the responses. It may cause unintentional data update at your server side, which you do not want. GET method is meant to return content without updating anything on server. POST is meant to updated the things at server and return result against that operation.

Can response data from core reporting api be grouped?

Explanation:
I am able to query the Google Core reporting APIv3 using the client library to get data on pageviews for specific URLs of a website I am working on. I want to get data(pageviews) for each day within a specified range. So far I am simply looping through the range, sending individual request to the API. in each request I am setting the same value for the start date and the end date.
Problem:
Obviously this gets the job done, BUT it is certainly not the best way to go about it. Because, assumming I want to get data for the past 3 months for each of about 2000 URIs. Then I will need 360000 number of requests and that value is well over the limit quota defined by Google.
Potential solution: So one way I thought of solving this issue is probably to send a request setting start-date and end-date to be a week apart but the API will return a sum of the values rather than the individual values.
main question: So is there a way to insist that these values should not be added up and returned as a sum but rather returned (as associative array or something like that) separately for each.
I hope the question is clear and that there is a solution! Thank you!
Very straightforward:
Metric: ga:pageview, Dimension: ga:date, Set a filter for your pagepath, and set a start-date and end-date.
Example:
https://www.googleapis.com/analytics/v3/data/ga?ids=ga%3Axxyyzz&dimensions=ga%3Adate&metrics=ga%3Apageviews&filters=ga%3Apagepath%3D%3D%2Ffaq.html&start-date=2013-06-27&end-date=2013-07-11&max-results=50
This will return the pageviews for that the faq.html& page for each day in the time-frame.
You should check out the QueryExplorer. Great tool to find out how to structure queries.