REST API: Metadata goes to DB, file to storage. To proxy or not to proxy through API end-point? - rest

I'm currently planning a REST-style API. The problem I have is that the client will send one or more files, belonging to the same "document", but while the metadata is to be stored in a DB, the files are going to file storage (probably S3, in my case).
The way I see it, there are two ways of doing it:
Send the metadata to the API end-point, which responds with the location for storing the files. And then, in a separate request, store the files directly.
Send metadata and files, in the same request, to the API, which acts as a proxy and takes care of sending the various parts to their final destinations.
The good thing about 1. is that the API server will have less to deal with, so can be smaller, and bandwidth is only paid once (client -> storage). Giving a good UX is, on the other hand, likely to be harder, and there will be more state to keep track of.
With 2. it's easy to ensure the transaction is atomic, since the API server is the sole gatekeeper. However, the server will need to be more powerful, and bandwidth may be paid twice (client -> API -> storage).
So, what's the best way of dealing with this situation, and if going with 1. any problems to look out for?

Assuming you have external clients, I believe that #2 is the better bet. The way to catch and keep clients is to have the best possible UX, with a simple, easy to learn and use interface. As you said, you also get to keep atomic transactions, which will save you plenty of headaches. In my experience, server power is relatively cheap, and you can always send a 202 back to the client instead of a 201.

Related

How Caching Is Done Under The Hood in REST APIs?

One of the properties of REST APIs is cacheability. I want to understand how caching is done? And is it on client side (i.e. let's say on API client Postman or Insomnia) or on server side or both?
Suppose a resource is accessed as
GET /services/data/{api_version}/{product_tag}/{resource}/{id} and
we get a response.
If we again trigger the same endpoint call almost instantly, we get another response.
Considering API did caching on first call itself, two scenarios:
Data did not change in between two calls. In that case, caching gives correct result.
Data did change between calls. If client relies on cache, stale data is served to user.
How client determines that data changed and serves latest result? Is it something related like setting a dirty bit as we do in operating systems?
I know cache invalidation determination is one of the toughest problems in computer science and depends on scenario, but in general,
What things to cache on client side and what on server side? Caching done by Postman cannot be used by Insomnia.
How to always serve latest data while using cache to its fullest?

Is there standard way of making multiple API calls combined into one HTTP request?

While designing rest API's I time to time have challenge to deal with batch operations (e.g. delete or update many entities at once) to reduce overhead of many tcp client connections. And in particular situation problem usually solves by adding custom api method for specific operation (e.g. POST /files/batchDelete which accepts ids at request body) which doesn't look pretty from point of view of rest api design principles but do the job.
But for me general solution for the problem still desirable. Recently I found Google Cloud Storage JSON API batching documentation which for me looks like pretty general solution. I mean similar format may be used for any http api, not just google cloud storage. So my question is - does anybody know kind of general standard (standard or it's draft, guideline, community effort or so) of making multiple API calls combined into one HTTP request?
I'm aware of capabilities of http/2 which include usage of single tcp connection for http requests but my question is addressed to application level. Which in my opinion still make sense because despite of ability to use http/2 taking that on application level seems like the only way to guarantee that for any client including http/1 which is currently the most used version of http.
TL;DR
REST nor HTTP are ideal for batch operations.
Usually caching, which is one of RESTs constraints, which is not optional but mandatory, prevents batch processing in some form.
It might be beneficial to not expose the data to update or remove in batch as own resources but as data elements within a single resource, like a data table in a HTML page. Here updating or removing all or parts of the entries should be straight forward.
If the system in general is write-intensive it is probably better to think of other solutions such as exposing the DB directly to those clients to spare a further level of indirection and complexity.
Utilization of caching may prevent a lot of workload on the server and even spare unnecessary connecctions
To start with, REST nor HTTP are ideal for batch operations. As Jim Webber pointed out the application domain of HTTP is the transfer of documents over the Web. This is what HTTP does and this is what it is good at. However, any business rules we conclude are just a side effect of the document management and we have to come up with solutions to turn this document management side effects to something useful.
As REST is just a generalization of the concepts used in the browsable Web, it is no miracle that the same concepts that apply to Web development also apply to REST development in some form. Thereby a question like how something should be done in REST usually resolves around answering how something should be done on the Web.
As mentioned before, HTTP isn't ideal in terms of batch processing actions. Sure, a GET request may retrieve multiple results, though in reality you obtain one response containing links to further resources. The creation of resources has, according to the HTTP specification, to be indicated with a Location header that points to the newly created resource. POST is defined as an all purpose method that allows to perform tasks according to server-specific semantics. So you could basically use it to create multiple resources at once. However, the HTTP spec clearly lacks support for indicating the creation of multiple resources at once as the Location header may only appear once per response as well as define only one URI in it. So how can a server indicate the creation of multiple resources to the server?
A further indication that HTTP isn't ideal for batch processing is that a URI must reference a single resource. That resource may change over time, though the URI can't ever point to multiple resources at once. The URI itself is, more or less, used as key by caches which store a cacheable response representation for that URI. As a URI may only ever reference one single resource, a cache will also only ever store the representation of one resource for that URI. A cache will invalidate a stored representation for a URI if an unsafe operation is performed on that URI. In case of a DELETE operation, which is by nature unsafe, the representation for the URI the DELETE is performed on will be removed. If you now "redirect" the DELETE operation to remove multiple backing resources at once, how should a cache take notice of that? It only operates on the URI invoked. Hence even when you delete multiple resources in one go via DELETE a cache might still serve clients with outdated information as it simply didn't take notice of the removal yet and its freshness value would still indicate a fresh-enough state. Unless you disable caching by default, which somehow violates one of REST's constraints, or reduce the time period a representation is considered fresh enough to a very low value, clients will probably get served with outdated information. You could of course perform an unsafe operation on each of these URIs then to "clear" the cache, though in that case you could have invoked the DELETE operation on each resource you wanted to batch delete itself to start with.
It gets a bit easier though if the batch of data you want to remove is not explicitly captured via their own resources but as data of a single resource. Think of a data-table on a Web page where you have certain form-elements, such as a checkbox you can click on to mark an entry as delete candidate and then after invoking the submit button send the respective selected elements to the server which performs the removal of these items. Here only the state of one resource is updated and thus a simple POST, PUT or even PATCH operation can be performed on that resource URI. This also goes well with caching as outlined before as only one resource has to be altered, which through the usage of unsafe operations on that URI will automatically lead to an invalidation of any stored representation for the given URI.
The above mentioned usage of form-elements to mark certain elements for removal depends however on the media-type issued. In the case of HTML its forms section specifies the available components and their affordances. An affordance is the knowledge what you can and should do with certain objects. I.e. a button or link may want to be pushed, a text field may expect numeric or alphanumeric input which further may be length limited and so on. Other media types, such as hal-forms, halform or ion, attempt to provide form representations and components for a JSON based notation, however, support for such media-types is still quite limited.
As one of your concerns are the number of client connections to your service, I assume you have a write-intensive scenario as in read-intensive cases caching would probably take away a good chunk of load from your server. I.e. BBC once reported that they could reduce the load on their servers drastically just by introducing a one minute caching interval for recently requested resources. This mainly affected their start page and the linked articles as people clicked on the latest news more often than on old news. On receiving a couple of thousands, if not hundred thousands, request per minute they could, as mentioned before, reduce the number of requests actually reaching the server significantly and therefore take away a huge load on their servers.
Write intensive use-cases however can't take benefit of caching as much as read-intensive cases as the cache would get invalidated quite often and the actual request being forward to the server for processing. If the API is more or less used to perform CRUD operations, as so many "REST" APIs do in reality, it is questionable if it wouldn't be preferable to expose the database directly to the clients. Almost all modern database vendors ship with sophisticated user-right management options and allow to create views that can be exposed to certain users. The "REST API" on top of it basically just adds a further level of indirection and complexity in such a case. By exposing the DB directly, performing batch updates or deletions shouldn't be an issue at all as through the respective query languages support for such operations should already be build into the DB layer.
In regards to the number of connections clients create: HTTP from 1.0 on allows the reusage of connections via the Connection: keep-alive header directive. In HTTP/1.1 persistent connections are used by default if not explicitly requested to close via the respective Connection: close header directive. HTTP/2 introduced full-duplex connections that allow many channels and therefore requests to reuse the same connections at the same time. This is more or less a fix for the connection limitation suggested in RFC 2626 which plenty of Web developers avoided by using CDN and similar stuff. Currently most implementations use a maximum limit of 100 channels and therefore simultaneous downloads via a single connections AFAIK.
Usually opening and closing a connection takes a bit of time and server resources and the more open connections a server has to deal with the more a system may suffer. Though open connections with hardly any traffic aren't a big issue for most servers. While the connection creation was usually considered to be the costly part, through the usage of persistent connections that factor moved now towards the number of requests issued, hence the request for sending out batch-requests, which HTTP is not really made for. Again, as mentioned throughout the post, through the smart utilization of caching plenty of requests may never reach the server at all, if possible. This is probably one of the best optimization strategies to reduce the number of simultaneous requests, as probably plenty of requests might never reach the server at all. Probably the best advice to give is in such a case to have a look at what kind of resources are requested frequently, which requests take up a lot of processing capacity and which ones can easily get responded with by utilizing caching options.
reduce overhead of many tcp client connections
If this is the crux of the issue, the easiest way to solve this is to switch to HTTP/2
In a way, HTTP/2 does exactly what you want. You open 1 connection, and using that collection you can send many HTTP requests in parallel. Unlike batching in a single HTTP request, it's mostly transparent for clients and response and requests can be processed out of order.
Ultimately batching multiple operations in a single HTTP request is always a network hack.
HTTP/2 is widely available. If HTTP/1.1 is still the most used version (this might be true, but gap is closing), this has more to do with servers not yet being set up for it, not clients.

Can you create a rest api to display info from another site?

So far all the guides I've found for creating rest API's are for displaying stuff from your own site, but can you display stuff from another site?
Typically you'd do this by:
Proxying calls: When a request comes into your server, make a request to the remote server and pass it back to the user. You'll want to make sure you can make the requests quickly and cache results aggressively. You'll probably want to use a short timeout for the remote call and rate-limit API requests so your server can't be blocked making all these remote calls.
Pre-fetching: Downloading with a data dump periodically or pre-fetching the data you need so you can store it locally.
Keep in mind:
Are you allowed to use the API this way, according to its terms of use? If it's a website you're scraping, it may be okay for small hobby use, but not for a large commercial operation.
The remote source probably has its own rate limits in place. Can you realistically provide your service under those limits?
As mentioned, cache aggressively to avoid re-requesting the same data. Get to know HTTP caching standards (cache-control, etag, etc headers) to minimise network activity.
If you are proxying, consider choosing a data center near the API's data center to reduce latency.

Caching in a Service oriented architecture

In a distributed systems environment, we have a RESTful service that needs to provide high read throughput at low-latency. Due to limitations in the database technology and given its a read-heavy system, we decided to use MemCached. Now, in a SOA, there are atleast 2 choices for the location of the cache, basically client looks up in Cache before calling server vs client always calls server which looks up in cache. In both cases, caching itself is done in a distributed MemCached server.
Option 1: Client -> RESTful Service -> MemCached -> Database
OR
Option 2: Client -> MemCached -> RESTful Service -> Database
I have an opinion but i'd love to hear arguments for and against either option from SOA experts in the community. Please assume either option is feasible, its a architecture question. Appreciate sharing your experience.
I have seen the
Option 1: Client -> RESTful Service -> Cache Server -> Database
working very well. Pros IMHO are that you are able to operate wtih and use this layer in a way allowing you to "free" part of the load on the DB. Assuming that your end-users can have a lot of similar requests and after all the Client can decide what storage to spare for caching. Also how often to clear it.
I prefer Option 1 and I am currently using it. In this way it is easier to control the load on the DB (just as #ekostatinov mentioned). I have lots of data that are required for every user in the system, but the data is never changed (such as some system rules, types of items, etc). It really reduces the DB load. In this way you can also control the behavior of the cache (such as when to clear the items).
Option 1 is the prefered option as it makes memcache an implementation detail of the service. the other option means that if the business changes and things can't be kept in the cache (or other can etc.) the clients would have to change. Option 1 hides all that behind the service interface.
Additionally option 1 lets you evolve the service as you wish. e.g. maybe later you think you need a new technology, maybe you'd solve the performance problem with the DB etc. Again, option 1 lets you make all these changes without dragging the clients into the mess
Is the REST ful API exposed to external consumers. In that case it is up to the consumer to decide if they want to use a cache and how much stale data can they use.
As for as the REST ful service goes, the service is the container of business logic and it is the authority of data, so it decides how much to cache, cache expiry, when to flush etc. A client consuming the REST service always assumes that the service is providing it with the latest data. And hence option 1 is preferred.
Who is the client in this case?
Is it a wrapper for your REST API. Are you providing both client and the service.
I can share my experience with Enduro/X middleware implementation. For local XATMI service calls any client process connects to shared memory (LMDB) and checks the result there. If there is response saved it returns data directly from shm. If data is not there, client process goes the longest path and performs the IPC. In case of REST access, network clients still performs the HTTP invocation, but HTTP server as XATMI client returns the data from shared mem. From real life, this technique was greatly boosting the web frontend web application which used middleware via REST calls.

A RESTful approach to data synchronization

Assume the following scenario A web application serves up resources through a RESTful API. A number of clients consume this API. The goal is to keep the data on the clients synchronized with the web application (in both directions).
The easiest way to do this is to ask the API if any of the resources have changed since the client last synchronized with the API. This means that the client needs to ask the API for the appropriate resources accompanied by timestamp (to see if the data needs to be updated). This seems to me like the approach with the least overhead in terms of needless consumption of bandwidth.
However, I have the feeling that this approach has a few downsides in terms of design and responsibilities. For example, the API shouldn't have to deal with checking whether the resources are out of date. It seems that the only responsibility of the API should be to serve up the resources when asked without having to deal with the updating aspect. By following this second approach, the client would ask for a lot of data every time it wants to update its data to keep it synchronized with the web application. In other words, the client would check whether the data it got back is newer than the locally stored data. If this process takes place every few minutes, this might become a significant burden for the system.
Am I seeing this correctly or is there a middle road that I am overlooking?
This is a pretty common problem, and a RESTful approach can help you solve it. HTTP (the application protocol typically used to build RESTful services) supports a variety of techniques that can be used to keep API clients in sync with the data on the server side.
If the client receives a Last-Modified or E-Tag header in a HTTP response, it may use that information to make conditional GET calls in the future. This allows the server to quickly indicate with a 304 – Not Modified response that the client’s previously stored representation of the resource is still valid and accurate. This will allow the server (or even better, an intermediate proxy or cache server) to be as efficient as possible in how it responds to the client’s requests, potentially reducing costly round-trips to a back-end data store.
If a response contains a Last-Modified header and the client wishes to take advantage of the performance optimization available with it, they must include an If-Modified-Since directive in a subsequent GET call to the same URI, passing in the same timestamp value it received. This instructs the server to only GET the information from the authoritative back-end source if it knows it has changed since that time. Your server will have to be built to support this technique, of course.
A similar principle applies to E-Tag headers. An E-Tag is a simple hash code representing a specific state of the resource at a particular point in time. If the resource changes in any way, so does its E-Tag value. If the client sees an E-Tag in a response it should pass it in subsequent GET requests to the same URI, thereby allowing the server to quickly determine if the client has an up-to-date representation of the resource.
Finally, you should probably look at the long polling technique to reduce the number of repeated GET requests issued by your clients to the server. In essence, the trick is to issue very long GET requests to the server to watch for server data changes. The GET will not return a response until either the data has changed or the very long timeout fires. If the latter, the client just re-issues the same long-lived request to watch for changes again. See also topics like Comet and Web Sockets which are similar in approach.