So far all the guides I've found for creating rest API's are for displaying stuff from your own site, but can you display stuff from another site?
Typically you'd do this by:
Proxying calls: When a request comes into your server, make a request to the remote server and pass it back to the user. You'll want to make sure you can make the requests quickly and cache results aggressively. You'll probably want to use a short timeout for the remote call and rate-limit API requests so your server can't be blocked making all these remote calls.
Pre-fetching: Downloading with a data dump periodically or pre-fetching the data you need so you can store it locally.
Keep in mind:
Are you allowed to use the API this way, according to its terms of use? If it's a website you're scraping, it may be okay for small hobby use, but not for a large commercial operation.
The remote source probably has its own rate limits in place. Can you realistically provide your service under those limits?
As mentioned, cache aggressively to avoid re-requesting the same data. Get to know HTTP caching standards (cache-control, etag, etc headers) to minimise network activity.
If you are proxying, consider choosing a data center near the API's data center to reduce latency.
Related
One of the properties of REST APIs is cacheability. I want to understand how caching is done? And is it on client side (i.e. let's say on API client Postman or Insomnia) or on server side or both?
Suppose a resource is accessed as
GET /services/data/{api_version}/{product_tag}/{resource}/{id} and
we get a response.
If we again trigger the same endpoint call almost instantly, we get another response.
Considering API did caching on first call itself, two scenarios:
Data did not change in between two calls. In that case, caching gives correct result.
Data did change between calls. If client relies on cache, stale data is served to user.
How client determines that data changed and serves latest result? Is it something related like setting a dirty bit as we do in operating systems?
I know cache invalidation determination is one of the toughest problems in computer science and depends on scenario, but in general,
What things to cache on client side and what on server side? Caching done by Postman cannot be used by Insomnia.
How to always serve latest data while using cache to its fullest?
I'm developing an app that would download codes from GitHub once in a while. The queries (per IP) will stay waaay under the limits listed on the GH documentation (QPM && QPH). The queries are simple cUrls to HTTP GitHub code pages (raw) without passing through the API.
Let's suppose 50k users will query a GH resource on the same day: are the queriers risking some kind of ban?
Limits 1
Limits 2
The rate limits for the raw and archive endpoints (which are the same service) are currently the same as for authenticated API endpoints: 5000 per hour. It does sometimes happen that, due to the way the rate-limiting works, the actual number of requests you can make is higher, but you should not rely on that.
In general, those endpoints are not designed to be a CDN or code distribution network for your app. They're designed instead to provide easy access for individual users to the raw contents of a few files instead of having to clone the entire repository. If you do this anyway and end up using excessive resources for your repository, GitHub Support will reach out to you and ask you to stop, and your repository may be suspended if you don't cause the excessive resource use to be stopped promptly.
If you are going to make any sort of automated requests to those endpoints, it's strongly recommended that you use a unique User-Agent header in your library so they can be effectively identified, and preferably that that header contain some sort of identifying information (a URL or email address related to the project) so that you can be contacted about problems.
Your app should also gracefully handle errors like 403 and 429 and back off or stop trying to connect. This is true of every HTTP client.
If you want to distribute code or assets for your app, you should do so using a CDN backed by a server you control rather than the GitHub raw endpoints. You should be sure to serve this data over HTTPS and implement a secure digital signature mechanism to prevent malicious code or assets from being distributed.
I'm currently planning a REST-style API. The problem I have is that the client will send one or more files, belonging to the same "document", but while the metadata is to be stored in a DB, the files are going to file storage (probably S3, in my case).
The way I see it, there are two ways of doing it:
Send the metadata to the API end-point, which responds with the location for storing the files. And then, in a separate request, store the files directly.
Send metadata and files, in the same request, to the API, which acts as a proxy and takes care of sending the various parts to their final destinations.
The good thing about 1. is that the API server will have less to deal with, so can be smaller, and bandwidth is only paid once (client -> storage). Giving a good UX is, on the other hand, likely to be harder, and there will be more state to keep track of.
With 2. it's easy to ensure the transaction is atomic, since the API server is the sole gatekeeper. However, the server will need to be more powerful, and bandwidth may be paid twice (client -> API -> storage).
So, what's the best way of dealing with this situation, and if going with 1. any problems to look out for?
Assuming you have external clients, I believe that #2 is the better bet. The way to catch and keep clients is to have the best possible UX, with a simple, easy to learn and use interface. As you said, you also get to keep atomic transactions, which will save you plenty of headaches. In my experience, server power is relatively cheap, and you can always send a 202 back to the client instead of a 201.
Assume the following scenario A web application serves up resources through a RESTful API. A number of clients consume this API. The goal is to keep the data on the clients synchronized with the web application (in both directions).
The easiest way to do this is to ask the API if any of the resources have changed since the client last synchronized with the API. This means that the client needs to ask the API for the appropriate resources accompanied by timestamp (to see if the data needs to be updated). This seems to me like the approach with the least overhead in terms of needless consumption of bandwidth.
However, I have the feeling that this approach has a few downsides in terms of design and responsibilities. For example, the API shouldn't have to deal with checking whether the resources are out of date. It seems that the only responsibility of the API should be to serve up the resources when asked without having to deal with the updating aspect. By following this second approach, the client would ask for a lot of data every time it wants to update its data to keep it synchronized with the web application. In other words, the client would check whether the data it got back is newer than the locally stored data. If this process takes place every few minutes, this might become a significant burden for the system.
Am I seeing this correctly or is there a middle road that I am overlooking?
This is a pretty common problem, and a RESTful approach can help you solve it. HTTP (the application protocol typically used to build RESTful services) supports a variety of techniques that can be used to keep API clients in sync with the data on the server side.
If the client receives a Last-Modified or E-Tag header in a HTTP response, it may use that information to make conditional GET calls in the future. This allows the server to quickly indicate with a 304 – Not Modified response that the client’s previously stored representation of the resource is still valid and accurate. This will allow the server (or even better, an intermediate proxy or cache server) to be as efficient as possible in how it responds to the client’s requests, potentially reducing costly round-trips to a back-end data store.
If a response contains a Last-Modified header and the client wishes to take advantage of the performance optimization available with it, they must include an If-Modified-Since directive in a subsequent GET call to the same URI, passing in the same timestamp value it received. This instructs the server to only GET the information from the authoritative back-end source if it knows it has changed since that time. Your server will have to be built to support this technique, of course.
A similar principle applies to E-Tag headers. An E-Tag is a simple hash code representing a specific state of the resource at a particular point in time. If the resource changes in any way, so does its E-Tag value. If the client sees an E-Tag in a response it should pass it in subsequent GET requests to the same URI, thereby allowing the server to quickly determine if the client has an up-to-date representation of the resource.
Finally, you should probably look at the long polling technique to reduce the number of repeated GET requests issued by your clients to the server. In essence, the trick is to issue very long GET requests to the server to watch for server data changes. The GET will not return a response until either the data has changed or the very long timeout fires. If the latter, the client just re-issues the same long-lived request to watch for changes again. See also topics like Comet and Web Sockets which are similar in approach.
I am consuming a rate-limited web service, which only allows me to make 5 calls per second. I am using my server to proxy these calls to a web client:
Mashery > My Web Server > Client's Browser
I have optimized the usage of this web service, but there are still occasional times when I go over the rate limit. What I would like to do instead is hold the client's request for one second (or longer if warranted) before making the web service call to Mashery.
There are some ways I can solve this by building a simple queue system with a database back-end, but I'd rather avoid that if something already exists. Does something already exist to rate-limit the consuming side of this?
To reliably throttle requests to only 5 per second, I would employ a queue/worker system. But before that I would just request a higher rate limit by contacting API support for that platform. Of course, I would also consider using caching if it's a read-only method and you're pulling down a lot of the same data and it's compliant with the API TOS.