RESTful web requests and user activity tracking websites

RESTful web requests and user activity tracking websites - rest

Someone asked me this question a couple of days ago and I didn't have an answer:
As HTTP is a stateless protocol. When we open www.google.com, can it
be called a REST call?
What I think:
When we make a search on google.com all info is passed through cookie and URL parameters. It looks like a stateless request.
But the search results are not independent of user's past request. Search results are specific to user interest and behavior. Now, it doesn't look like a stateless request.
I know this is an old question and I have read many SO answers like Why HTTP is a stateless protocol? but I am still not able to understand what happens when user activity is tracked like on google or Amazon(recommendations based on past purchases) or any other user activity based recommendation websites.
Is it RESTful or is it RESTless?
What if I want to create a web app in which I use REST architecture and still provide user-specific responses?

HTTP is stateless, however the Google Application Layer is not. The specific Cookies and their meaning is part of the Application Layer.
Consider the same with TCP/IP. IP is a stateless protocol, but TCP isn't. The existence of state in TCP embedded in IP packets does not mean that IP protocol itself has a state.
So does that make it a REST call? No.
Although HTTP is stateless & I would suspect that www.google.com when requested with cookies disabled, the results would be the same for each request, making it almost stateless (Google still probably tracks IP to limit request frequency).
But the Application Layer is not stateless. One of the principles of REST is that the system does not retain state data about about the client between requests for the purpose of modifying the responses. In the case of Google, that clearly is not happening.

It seems that the meaning of "stateless" is being (hypothetically) taken beyond its practical expression.
Consider a web system with no DB at all. You call a (RESTful) API, you always get the exactly the same results. This is perfectly stateless... But this is perfectly not a real system, either.
A real system, in practically every implementation, holds data. Moreover, that data is the "resources" that RESTful API allows us to access. Of course, data changes, due to API calls as well. So, if you get a resource's value, change its value, and then get its value again, you will get a different value than the first read; however, this clearly does not say that the reads themselves were not stateless. They are stateless in the sense that they represent the very same action (or, more exact, resource) for each call. Change has to be manually done, using another RESTful API, to change the resource value, that will then be reflected in the next call.
However, what will be the case if we have a resource that changes without a manual, standard API verb?
For example, suppose that we have a resource that counts the number of times some other resource was accessed. Or some other resource that is being populated from some other third party data. Clearly, this is still a stateless protocol.
Moreover, in some sense, almost any system -- say, any system that includes an authentication mechanism -- responds differently for the same API calls, depending, for example, on the user's privileges. And yet, clearly, RESTful systems are not forbidden to authenticate their users...
In short, stateless systems are stateless for the sake of that protocol. If Google tracks the calls so that if I call the same resource in the same session I will get different answers, then it breaks the stateless requirement. But so long as the returned response is different due to application level data, and are not session related, this requirement is not broken.
AFAIK, what Google does is not necessarily related to sessions. If the same user will run the same search under completely identical conditions (e.g., IP, geographical location, OS, browser, etc.), they will get the very same response. If a new identical search will produce different results due to what Google have "learnt" in the last call, it is still stateless, because -- again -- that second call would have produced the very same result if it was done in another session but under identical conditions.

You should probably start from Fielding's comments on cookies in his thesis, and then review Fielding's further thoughts, published on rest-discuss.
My interpretation of Fielding's thoughts, applied to this question: no, it's not REST. The search results change depending on the state of the cookie header in the request, which is to say that the representation of the resource changes depending on the cookie, which is to say that part of the resource's identifier is captured in the cookie header.
Most of the problems with cookies are due to breaking visibility,
which impacts caching and the hypertext application engine -- Fielding, 2003
As it happens, caching doesn't seem to be a big priority for Google; the representation returned to be included a cache control private header, which restricts the participation by intermediate components.

Related

Is there standard way of making multiple API calls combined into one HTTP request?

While designing rest API's I time to time have challenge to deal with batch operations (e.g. delete or update many entities at once) to reduce overhead of many tcp client connections. And in particular situation problem usually solves by adding custom api method for specific operation (e.g. POST /files/batchDelete which accepts ids at request body) which doesn't look pretty from point of view of rest api design principles but do the job.
But for me general solution for the problem still desirable. Recently I found Google Cloud Storage JSON API batching documentation which for me looks like pretty general solution. I mean similar format may be used for any http api, not just google cloud storage. So my question is - does anybody know kind of general standard (standard or it's draft, guideline, community effort or so) of making multiple API calls combined into one HTTP request?
I'm aware of capabilities of http/2 which include usage of single tcp connection for http requests but my question is addressed to application level. Which in my opinion still make sense because despite of ability to use http/2 taking that on application level seems like the only way to guarantee that for any client including http/1 which is currently the most used version of http.

TL;DR
REST nor HTTP are ideal for batch operations.
Usually caching, which is one of RESTs constraints, which is not optional but mandatory, prevents batch processing in some form.
It might be beneficial to not expose the data to update or remove in batch as own resources but as data elements within a single resource, like a data table in a HTML page. Here updating or removing all or parts of the entries should be straight forward.
If the system in general is write-intensive it is probably better to think of other solutions such as exposing the DB directly to those clients to spare a further level of indirection and complexity.
Utilization of caching may prevent a lot of workload on the server and even spare unnecessary connecctions
To start with, REST nor HTTP are ideal for batch operations. As Jim Webber pointed out the application domain of HTTP is the transfer of documents over the Web. This is what HTTP does and this is what it is good at. However, any business rules we conclude are just a side effect of the document management and we have to come up with solutions to turn this document management side effects to something useful.
As REST is just a generalization of the concepts used in the browsable Web, it is no miracle that the same concepts that apply to Web development also apply to REST development in some form. Thereby a question like how something should be done in REST usually resolves around answering how something should be done on the Web.
As mentioned before, HTTP isn't ideal in terms of batch processing actions. Sure, a GET request may retrieve multiple results, though in reality you obtain one response containing links to further resources. The creation of resources has, according to the HTTP specification, to be indicated with a Location header that points to the newly created resource. POST is defined as an all purpose method that allows to perform tasks according to server-specific semantics. So you could basically use it to create multiple resources at once. However, the HTTP spec clearly lacks support for indicating the creation of multiple resources at once as the Location header may only appear once per response as well as define only one URI in it. So how can a server indicate the creation of multiple resources to the server?
A further indication that HTTP isn't ideal for batch processing is that a URI must reference a single resource. That resource may change over time, though the URI can't ever point to multiple resources at once. The URI itself is, more or less, used as key by caches which store a cacheable response representation for that URI. As a URI may only ever reference one single resource, a cache will also only ever store the representation of one resource for that URI. A cache will invalidate a stored representation for a URI if an unsafe operation is performed on that URI. In case of a DELETE operation, which is by nature unsafe, the representation for the URI the DELETE is performed on will be removed. If you now "redirect" the DELETE operation to remove multiple backing resources at once, how should a cache take notice of that? It only operates on the URI invoked. Hence even when you delete multiple resources in one go via DELETE a cache might still serve clients with outdated information as it simply didn't take notice of the removal yet and its freshness value would still indicate a fresh-enough state. Unless you disable caching by default, which somehow violates one of REST's constraints, or reduce the time period a representation is considered fresh enough to a very low value, clients will probably get served with outdated information. You could of course perform an unsafe operation on each of these URIs then to "clear" the cache, though in that case you could have invoked the DELETE operation on each resource you wanted to batch delete itself to start with.
It gets a bit easier though if the batch of data you want to remove is not explicitly captured via their own resources but as data of a single resource. Think of a data-table on a Web page where you have certain form-elements, such as a checkbox you can click on to mark an entry as delete candidate and then after invoking the submit button send the respective selected elements to the server which performs the removal of these items. Here only the state of one resource is updated and thus a simple POST, PUT or even PATCH operation can be performed on that resource URI. This also goes well with caching as outlined before as only one resource has to be altered, which through the usage of unsafe operations on that URI will automatically lead to an invalidation of any stored representation for the given URI.
The above mentioned usage of form-elements to mark certain elements for removal depends however on the media-type issued. In the case of HTML its forms section specifies the available components and their affordances. An affordance is the knowledge what you can and should do with certain objects. I.e. a button or link may want to be pushed, a text field may expect numeric or alphanumeric input which further may be length limited and so on. Other media types, such as hal-forms, halform or ion, attempt to provide form representations and components for a JSON based notation, however, support for such media-types is still quite limited.
As one of your concerns are the number of client connections to your service, I assume you have a write-intensive scenario as in read-intensive cases caching would probably take away a good chunk of load from your server. I.e. BBC once reported that they could reduce the load on their servers drastically just by introducing a one minute caching interval for recently requested resources. This mainly affected their start page and the linked articles as people clicked on the latest news more often than on old news. On receiving a couple of thousands, if not hundred thousands, request per minute they could, as mentioned before, reduce the number of requests actually reaching the server significantly and therefore take away a huge load on their servers.
Write intensive use-cases however can't take benefit of caching as much as read-intensive cases as the cache would get invalidated quite often and the actual request being forward to the server for processing. If the API is more or less used to perform CRUD operations, as so many "REST" APIs do in reality, it is questionable if it wouldn't be preferable to expose the database directly to the clients. Almost all modern database vendors ship with sophisticated user-right management options and allow to create views that can be exposed to certain users. The "REST API" on top of it basically just adds a further level of indirection and complexity in such a case. By exposing the DB directly, performing batch updates or deletions shouldn't be an issue at all as through the respective query languages support for such operations should already be build into the DB layer.
In regards to the number of connections clients create: HTTP from 1.0 on allows the reusage of connections via the Connection: keep-alive header directive. In HTTP/1.1 persistent connections are used by default if not explicitly requested to close via the respective Connection: close header directive. HTTP/2 introduced full-duplex connections that allow many channels and therefore requests to reuse the same connections at the same time. This is more or less a fix for the connection limitation suggested in RFC 2626 which plenty of Web developers avoided by using CDN and similar stuff. Currently most implementations use a maximum limit of 100 channels and therefore simultaneous downloads via a single connections AFAIK.
Usually opening and closing a connection takes a bit of time and server resources and the more open connections a server has to deal with the more a system may suffer. Though open connections with hardly any traffic aren't a big issue for most servers. While the connection creation was usually considered to be the costly part, through the usage of persistent connections that factor moved now towards the number of requests issued, hence the request for sending out batch-requests, which HTTP is not really made for. Again, as mentioned throughout the post, through the smart utilization of caching plenty of requests may never reach the server at all, if possible. This is probably one of the best optimization strategies to reduce the number of simultaneous requests, as probably plenty of requests might never reach the server at all. Probably the best advice to give is in such a case to have a look at what kind of resources are requested frequently, which requests take up a lot of processing capacity and which ones can easily get responded with by utilizing caching options.

reduce overhead of many tcp client connections
If this is the crux of the issue, the easiest way to solve this is to switch to HTTP/2
In a way, HTTP/2 does exactly what you want. You open 1 connection, and using that collection you can send many HTTP requests in parallel. Unlike batching in a single HTTP request, it's mostly transparent for clients and response and requests can be processed out of order.
Ultimately batching multiple operations in a single HTTP request is always a network hack.
HTTP/2 is widely available. If HTTP/1.1 is still the most used version (this might be true, but gap is closing), this has more to do with servers not yet being set up for it, not clients.

What is a REST API entry point and how is it different from an endpoint?

What is a REST API entry point and how is it different from an endpoint?
I have searched for various definitions online but still can't seem to wrap my head around them (I am new to APIs in general). From what I understand, they provide means of communicating with the server but what are they exactly and how are entry points and endpoints similar or different?

Agree with Roman Vottner here and gave a thumb up. I only want to add a few more links here for anyone trying to get a clear idea.
API Endpoint
I like the answer here: https://smartbear.com/learn/performance-monitoring/api-endpoints/
Simply put, an endpoint is one end of a communication channel. When an API interacts with another system, the touchpoints of this communication are considered endpoints. For APIs, an endpoint can include a URL of a server or service. Each endpoint is the location from which APIs can access the resources they need to carry out their function.
And samples here: What is an Endpoint?
https://example.com/api/login
https://example.com/api/accounts
https://example.com/api/cart/items
API Entry Point
Look here: https://restful-api-design.readthedocs.io/en/latest/urls.html
A RESTful API needs to have one and exactly one entry point. The URL of the entry point needs to be communicated to API clients so that they can find the API. Technically speaking, the entry point can be seen as a singleton resource that exists outside any collection.
So, following the previous example, it would be:
https://example.com/api
Extra note: in GraphQL world, there is a single endpoint for the API, no entry point (https://graphql.org/learn/best-practices/#http). Usually in the form
https://example.com/graphql

Simply speaking an entry point might be something like http://api.your-company.com which a client will enter without any a-priori knowledge. The API will teach the client everything it needs to know in order to make informed choices on what things it could do next.
Regarding Endpoints Wikipedia i.e. state the following:
Endpoint, the entry point to a service, a process, or a queue or topic destination in service-oriented architecture
In a broad sense, the endpoint is just the target host invoked that should process your request (or delegate to some other machines in case of load balancing and what not). In a more narrow sense an endpoint is just the server-sided stuff invoked that is processing your request, i.e. a URI like http://api.your-company.com/users/12345 will ask for a users representation (assuming a GET request). The concrete user is the resource processed while the endpoint might be actually a Spring (or framework of your choice) based service handling all requests targeting everything http://api.your-company.com/users/* related.

Use of HTTP RESTful methods GET/POST/etc. Are they superfluous?

What is the value of RESTful “methods” (ie. GET, POST, PUT, DELETE, PATCH, etc.)?
Why not just make every client use the “GET” method w/ any/all relevant params, headers, requestbodies, JSON,etc. etc.?
On the server side, the response to each method is custom & independently coded!
For example, what difference does it make to issue a database query via GET instead of POST?
I understand that GET is for queries that don’t change the DB (or anything else?).
And POST is for calls that do make changes.
But, near as I can tell, the RESTful standard doesn’t prevent one to code up a server response to GET and issue a stored procedure call that indeed DOES change the DB.
Vice versa… the RESTful standard doesn’t prevent one to code up a server response to POST and issue a stored procedure call that indeed does NOT change the ANYTHING!
I’m not arguing that a midtier (HTTP) “RESTlike” layer is necessary. It clearly is.
Let's say I'm wrong (and I may be). Isn't it still likely that there are numerous REST servers violating the proper use of these protocols suffering ZERO repercussions?
The following do not directly address my questions but merely dance uncomfortably around it like an acidhead stoner at a Dead concert:
Different Models for RESTful GET and POST
RESTful - GET or POST - what to do?
GET vs POST in REST Web Service
PUT vs POST in REST
I just spent ~80 hours trying to communicate a PATCH to my REST server (older Android Java doesn't recognize the newer PATCH so I had to issue a stupid kluge HTTP-OVERIDE-METHOD in the header). A POST would have worked fine but the sysop wouldn't budge because he respects REST.
I just don’t understand why to bother with each individual method. They don't seem to have much impact on Idempotence. They seem to be mere guidelines. And if you "violate" these "guidelines" they give someone else a chance to point a feckless finger at you. But so what?
Aren't these guidelines more trouble than they're worth?
I'm just confused. Please excuse the stridency of my post.
Aren’t REST GET/POST/etc. methods superfluous?

What is the value of RESTful “methods” (ie. GET, POST, PUT, DELETE, PATCH, etc.)?
First, a clarification. Those aren't RESTful methods; those are HTTP methods. The web is a reference implementation (for the most part) of the REST architectural style.
Which means that the authoritative answers to your questions are documented in the HTTP specification.
But, near as I can tell, the RESTful standard doesn’t prevent one to code up a server response to GET and issue a stored procedure call that indeed DOES change the DB.
The HTTP specification designates certain methods as being safe. Casually, this designates that a method is read only; the client is not responsible for any side effects that may occur on the server.
The purpose of distinguishing between safe and unsafe methods is to allow automated retrieval processes (spiders) and cache performance optimization (pre-fetching) to work without fear of causing harm.
But you are right, the HTTP standard doesn't prevent you from changing your database in response to a GET request. In fact, it even calls out specifically a case where you may choose to do that:
a safe request initiated by selecting an advertisement on the Web will often have the side effect of charging an advertising account.
The HTTP specification also designates certain methods as being idempotent
Of the request methods defined by this specification, PUT, DELETE, and safe request methods are idempotent.
The motivation for having idempotent methods? Unreliable networks
Idempotent methods are distinguished because the request can be repeated automatically if a communication failure occurs before the client is able to read the server's response.
Note that the client here might not be the user agent, but an intermediary component (like a reverse proxy) participating in the conversation.
Thus, if I'm writing a user agent, or a component, that needs to talk to your server, and your server conforms to the definition of methods in the HTTP specification, then I don't need to know anything about your application protocol to know how to correctly handle lost messages when the method is GET, PUT, or DELETE.
On the other hand, POST doesn't tell me anything, and since the unacknowledged message may still be on its way to you, it is dangerous to send a duplicate copy of the message.
Isn't it still likely that there are numerous REST servers violating the proper use of these protocols suffering ZERO repercussions?
Absolutely -- remember, the reference implementation of hypermedia is HTML, and HTML doesn't include support PUT or DELETE. If you want to afford a hypermedia control that invokes an unsafe operation, while still conforming to the HTTP and HTML standards, the POST is your only option.
Aren't these guidelines more trouble than they're worth?
Not really? They offer real value in reliability, and the extra complexity they add to the mix is pretty minimal.
I just don’t understand why to bother with each individual method. They don't seem to have much impact on idempotence.
They don't impact it, they communicate it.
The server already knows which of its resources are idempotent receivers. It's the client and the intermediary components that need that information. The HTTP specification gives you the ability to communicate that information for free to any other compliant component.
Using the maximally appropriate method for each request means that you can deploy your solution into a topology of commodity components, and it just works.
Alternatively, you can give up reliable messaging. Or you can write a bunch of custom code in your components to tell them explicitly which of your endpoints are idempotent receivers.
POST vs PATCH
Same song, different verse. If a resource supports OPTIONS, GET, and PATCH, then I can discover everything I need to know to execute a partial update, and I can do so using the same commodity implementation I use everywhere else.
Achieving the same result with POST is a whole lot more work. For instance, you need some mechanism for communicating to the client that POST has partial update semantics, and what media-types are accepted when patching a specific resource.
What do I lose by making each call on the client GET and the server honoring such just by paying attention to the request and not the method?
Conforming user-agents are allowed to assume that GET is safe. If you have side effects (writes) on endpoints accessible via GET, then the agent is allowed to pre-fetch the endpoint as an optimization -- the side effects start firing even though nobody expects it.
If the endpoint isn't an idempotent receiver, then you have to consider that the GET calls can happen more than once.
Furthermore, the user agent and intermediary components are allowed to make assumptions about caching -- requests that you expect to get all the way through to the server don't, because conforming components along the way are permitted to server replies out of their own cache.
To ice the cake, you are introducing another additional risk; undefined behavior.
A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request.
Where I believe you are coming from, though I'm not certain, is more of an RPC point of view. Client sends a message, server responds; so long as both participants in the conversation have a common understanding of the semantics of the message, does it matter if the text in the message says "GET" or "POST" or "PATCH"? Of course not.
RPC is a fantastic choice when it fits the problem you are trying to solve.
But...
RPC at web scale is hard. Can your team deliver that? can your team deliver with cost effectiveness?
On the other hand, HTTP at scale is comparatively simple; there's an enormous ecosystem of goodies, using scalable architectures, that are stable, tested, well understood, and inexpensive. The tires are well and truly kicked.
You and your team hardly have to do anything; a bit of block and tackle to comply with the HTTP standards, and from that point on you can concentrate on delivering business value while you fall into the pit of success.

HATEOAS and server state

Trying to understand REST HATEOAS:
Suppose I have a service that has state; they are: initial, ready, running. I have a client that connects to the service, obtains a page with links that allow it to mutate the service state.
It uses one of the links to change the service's state and obtains another page with new links.
As long as there is 1 client, the state the client holds is identical with the service. But if there is a second client and it changes the service's state, the first client's representation is stale.
How is this resolved in HATEOAS? From what I've read it seems that REST is not applicable and I should maybe look at something else. If so, what?
Thanks!

This is not resolved by HATEOS (entirely). As REST is stateless this is kind of a paradox use case to keep state in client and server aligned.
Assuming I understand your requirements, yes, you're state in client 1 is stale and not the same as the one on the server. But what if the client would make a periodic call to the server to see whether some other client changed it? If so, with HATEOAS you could provide a link to serve the current state and omit the link to change the state.

#Kay - Thanks for answering.
I'm going to try to answer my own question. I realized after reading your answer that the "application" in HATEOAS is really the virtual application the client experiences when it retrieves and processes the resources it gets from the server. Its states are the pages (resources) it transitions between. The server (service?) may have its own state but it's not the same as the client's.
As long as this distinction is kept in mind, it is not unRESTful to have stale links in client 1. The server simply responds with new links reflecting its own state. And the client makes new transitions based on the updated links.
Still trying to understand. If I have it wrong, I'd appreciate some help.
Thanks!

The stateless requirement of REST refers to the ability of the server to understand and process the client request independent from any previous interactions it has made with said client. In other words, the client should be able to send a request "out of the blue" to the server (I.e., without a session saved on said server) and have the request processed. Hence there isn't a concept of login and logout in a purely RESTful architecture.
That's a different constraint than HATEOAS. Basically, "hypermedia as the engine of application state" means that all state is conveyed through the media type being used and not the connection itself. The client can (and often does) keep its own state, and can request snapshots of the state of resources from the server through resource representations (a.k.a media types).
If you want to be notified when a resource changes state, REST is (probably) the wrong choice. You'd likely want to use a different application protocol than, say, HTTP.

As Fielding says: it's not REST without HATEOAS. Don't call your service REST if it's ignore HATEAOS and stateful service can not be REST. You understood HATEOA. The server provides hyperlinks for the client which should be use to change the state located at the CLIENT SIDE.
To solve your problem: omit tend server from any state information. It will easier your life. Then implement REST as using the Richardson Maturity Model while consider information's from here.

A RESTful approach to data synchronization

Assume the following scenario A web application serves up resources through a RESTful API. A number of clients consume this API. The goal is to keep the data on the clients synchronized with the web application (in both directions).
The easiest way to do this is to ask the API if any of the resources have changed since the client last synchronized with the API. This means that the client needs to ask the API for the appropriate resources accompanied by timestamp (to see if the data needs to be updated). This seems to me like the approach with the least overhead in terms of needless consumption of bandwidth.
However, I have the feeling that this approach has a few downsides in terms of design and responsibilities. For example, the API shouldn't have to deal with checking whether the resources are out of date. It seems that the only responsibility of the API should be to serve up the resources when asked without having to deal with the updating aspect. By following this second approach, the client would ask for a lot of data every time it wants to update its data to keep it synchronized with the web application. In other words, the client would check whether the data it got back is newer than the locally stored data. If this process takes place every few minutes, this might become a significant burden for the system.
Am I seeing this correctly or is there a middle road that I am overlooking?

This is a pretty common problem, and a RESTful approach can help you solve it. HTTP (the application protocol typically used to build RESTful services) supports a variety of techniques that can be used to keep API clients in sync with the data on the server side.
If the client receives a Last-Modified or E-Tag header in a HTTP response, it may use that information to make conditional GET calls in the future. This allows the server to quickly indicate with a 304 – Not Modified response that the client’s previously stored representation of the resource is still valid and accurate. This will allow the server (or even better, an intermediate proxy or cache server) to be as efficient as possible in how it responds to the client’s requests, potentially reducing costly round-trips to a back-end data store.
If a response contains a Last-Modified header and the client wishes to take advantage of the performance optimization available with it, they must include an If-Modified-Since directive in a subsequent GET call to the same URI, passing in the same timestamp value it received. This instructs the server to only GET the information from the authoritative back-end source if it knows it has changed since that time. Your server will have to be built to support this technique, of course.
A similar principle applies to E-Tag headers. An E-Tag is a simple hash code representing a specific state of the resource at a particular point in time. If the resource changes in any way, so does its E-Tag value. If the client sees an E-Tag in a response it should pass it in subsequent GET requests to the same URI, thereby allowing the server to quickly determine if the client has an up-to-date representation of the resource.
Finally, you should probably look at the long polling technique to reduce the number of repeated GET requests issued by your clients to the server. In essence, the trick is to issue very long GET requests to the server to watch for server data changes. The GET will not return a response until either the data has changed or the very long timeout fires. If the latter, the client just re-issues the same long-lived request to watch for changes again. See also topics like Comet and Web Sockets which are similar in approach.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse