nginx - clear cache on http PUT or POST

nginx - clear cache on http PUT or POST - rest

I am testing nginx as a reverse proxy cache with REST resources (Spring MVC + ETag). Every GET is cached ok.
Is it possible to clear the nginx cache for a specific resource whenever it gets updated through a HTTP PUT or HTTP POST?
ps: I am also testing varnish cache, but I have the same doubt.
Thanks!

I had the same question and found that Nginx allows you to "purge" the cached requests:
proxy_cache_path /tmp/cache keys_zone=mycache:10m levels=1:2 inactive=60s;
map $request_method $purge_method {
PURGE 1;
default 0;
}
server {
listen 80;
server_name www.example.com;
location / {
proxy_pass http://localhost:8002;
proxy_cache mycache;
proxy_cache_purge $purge_method;
}
}
Then
curl -X PURGE -D – "http://www.example.com/*"
See:
https://www.nginx.com/products/nginx/caching/#proxy_cache_purge
https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_purge
So I guess you could make those calls after each POST/PUT/PATCH/DELETE.
I see 4 caveats:
Seems way more complicated (and probably slower) than the invalidation process offered by other caching strategies. So if you need to invalidate very often I'd probably recommend another caching strategy (e.g. caching the database queries in Redis)
I didn't try it myself and I'm not sure if it's available with the basic version of Nginx (the doc talks only about Nginx Plus, which is their paid product)
What's the granularity offered by this solution? Like is it possible to invalidate a request like GET /cars?year=2010&color=red and keep GET /cars?year=2020&color=green cached?
I don't know if it's a really common thing to do, I couldn't find more about it.

You have not specified what sort of caching you are implementing as there are several options within Nginx.
From your query, I assume you are referring to static files like images which are uploaded to your site.
Proxy Caching
This is where Nginx caches the response from a backend server. There is no point in activating this for static files in the first place. The proxy cache is simply a store on your hard disc and the cost of retrieving such files is the same as if you just let Nginx serve them from there actual locations on the filesystem.
FastCGI Caching
Same as Proxy Caching. No point for the type of files that may be uploaded using POST or PUT.
Memcache
Here, the items are stored in RAM and there is a benefit to this. There are the basic Memcache and the extended Memc Modules both of which have procedures for both adding to and removing from the cache.
Your query however suggests you are using one of the earlier two and as said, there is absolutely no benefit in doing this for the type of files that may be uploaded using POST or PUT. When cached in Nginx, they will be read from a disc location that they will be kept on just as would be done if referenced from the original disc location. There is also the overhead of copying them from the original disc location to another disc location.
Except of course if I am missing something.

Related

How to manage HATEOAS links when the server is the client?

I'm learning about HATEOAS. The backend server I'm working on will use a third party REST API that uses HATEOAS. That API has an end point to return the url for each resource and also returns the related resource links with regular requests.
But I'm wondering what's a good way to manage these links on the server to avoid hardcoding them. For example if the third party changes the url of the resource, how will the server detect that change? Are there any standard practices for managing HATEOAS resource links?
Possible ways I can think of
When the server starts, get all the resources urls and cache them. Whenever the third party API needs to be called, reuse these cached urls. Whenever there is a 404 or related error, update the resource url. Or update the url periodically in intervals.
Get the resource url each time before calling the end point. Simplest but essentially doubles the number of requests.
But neither sound like robust ways.

While discovery is generally a good thing and should allow a HATEOAS system to introduce changes in ways that 'hardcoded urls' don't, if urls start breaking arbitrarily I would still consider this a major issue.
You should be able to store urls / links on your side and have some expectation that those keep working.
There are some mechanisms that deal with changes though:
The server should return 301 / 308 redirects if a resource moved. If this were the case, you should also update your references.
The server can emit Sunset or Deprecated headers. See: https://www.rfc-editor.org/rfc/rfc8594
Those are more general answers, but ultimately the existence of best practices does not mean that vendors will abide by them. With that in mind I think your best bet is to try and find out what the deprecation policy is of your vendor and see what they recommend.

Use a cached resource if it is valid, request a refresh when you don't have a local valid copy.
RFC 7234 defines the caching semantics of HTTP.
Ideally, you don't implement the caching rules yourself, but instead you use a general purpose cache.
In its ideal form, your bespoke implementation is talking to a headless browser, and the headless browser worries about the caching rules for you.

In theory, you need the initial URL to start the process, and everything else comes from that.
Each resource you get from the server should include links to other edges on the graph of service for that resource.
So, once you get the initial resource, all of the rest come automatically.
That said, it's not untoward to have "well known" entry points that are, ideally, unchanging URLs. But in the end, those are just "bookmarks", and not necessarily guaranteed end points.
Consider a shopping site such as Amazon. Outside of amazon.com, you don't know any of their URLs. They're all provided on the various forms and pages, and the human simply navigates the site. Those URLs can be changing all the time, and no one would know. With HATEOAS, it's up to the machine to follow the links, rather than a human. But the process of navigation is the same.
As others have mentioned, idea of caching a root resource has merit. Then you rely on the caching headers to direct you to how often you have to refresh the links.
But that said, operationally, there's no difference between following a normal link, and following a cached link. Underneath, the cached resource loads faster, but you still need to "follow the link". Because that's where the caching behavior kicks in. This is different from assuming the link is good, assuming you know the result of a resource lookup. Your application follows the link. Always. The underlying infrastructure is responsible for making it efficient.
So, your code should not, say, load up a root resource, and then stuff a map filled with links, and then assume they're good. Rather, the code should request the root resource, perhaps as a Map of links (datatypes for the win), and let the next layer handle the details. Because it all depends on the type of caching involved. Some have coded durations where no followup is necessary. Others, you make the request anyway, and the server tier responds back "nothing changed", so you can use your local copy, but you're still require to ask in the first place.
Those are implementation details that the SERVER mandates (not the client). It's a server contract. If they want you pinging them each and every time, so be it. That's the contract they're presenting to you and if you want to be a Good Citizen, then you should honor that contact.
Ideally, the server makes good decisions on these kinds of issues for the sake of efficiency, but in the end it's really up to them.
The client has to go along. The client in a HATEOAS system cedes a lot to the server. They're simply not decisions for the client to make.

Pass Varnish backend response to solr

does anybody know if there is an option to combine varnish with Solr.
What I'd like to do is:
User requests a URL
Varnish hasn't a cached version or just an outdated version
Varnish calls the backend and finally receivs the response
This is the point where I'd like to hook into and pass the backend response to "./bin/solr post ..." so my solr index will immediately be updated every time I deliver a new content version.
Is this possible?
Thanks in advance
Boris

This is a twofold issue - one, sticking Varnish in front of Solr works as you'd expect. It takes the load away from Solr and allows Varnish to return content without having to query Solr.
However, you should keep your indexing process separate from the Varnish pipeline, otherwise you could have a very bad time if multiple threads start asking for an indexing process to be started within a short period of time. The proper way of doing this is to have a sensible time to live for responses in Varnish, use expire (through a ban or something similar in Varnish, or by attaching a index version identifier in your request to Solr), but launch and perform indexing outside of varnish' delivery of documents.
When the indexing completes you issue a ban to Varnish that tells Varnish that any existing, cached responses are invalid - this makes Varnish start querying your backend again. This way Varnish can do what it's great for, caching content, and you can keep the logic that decides when to update your index and when an update is necessary outside of Varnish.
And while Solr does effective caching, Varnish does a far better job (as it can consider only the response and not have to look at anything further behind in the chain) and can thus alleviate the load from repetitive queries.

Any optimizations in reducing the number of disk accesses for inode number lookup by web-servers?

Web-servers typically have a document root denoting the filesystem sub-tree visible via the web. Consequently for eg., if the document root is: /home/foouser/public_html/, then the web-server would map a request for http://www.foo.com/pics/foo.jpg to /home/foouser/public_html/pics/foo.jpg. This results in a series of disk requests to obtain the inode-number of foo.jpg.
Do web-servers do any optimizations to reduce the number of disk accesses (or) is it the role of the server-admin to set the document root as close to "/" as possible, to reduce the number of disk-accesses in the filename to inode number translation?

I know this isn't directly the answer to your question, but by setting up a caching strategy you can drastically reduce disk reads. Especially if your static content is not hosted on your server.
Options:
Host static content on a CDN:
Pros: Off-load all load onto someone else's network. Cost?
Cons: Potentially less control. Cost?
Use Contendo/Akamai, which is also a CDN, but with some differences.
Pros: Host your content, but after the first read the cdn will handle caching based on the headers you send with your content (static or not)
Cons: Sometimes headers are really annoying to manage. Cache busting (breaking your own cache) can be annoying to handle when you want to replace old content.
Cache things locally. If you are making a DB request for instance you can cache the request. Next time your code is run check your in memory cache first (as opposed to make a db request immediately). You could cache entire pages then at an application controller/route level check if there is a cached version of the page/asset and serve that.
Pros: Lots of control. You can cache almost anything.
Cons: A ton of work to set up caching on every little thing. You need a strategy for every part of your website.
My recommendation is to start out by moving your assets to AmazonS3 or Rackspace or something. Joyent has something for this as well. You could then enable cloudfront for s3 which will turn on the cdn, which caches things in various regions. This is a really cheap solution (depending on the amount of files you have).
You could also go the contendo route.
The caching on the application side route takes quite a bit of work and completely depends on your server/language/db/configuration.

Memcache(d) vs. Varnish for speeding up 3 tier web architecture

I'm trying to speed up my benchmark (3 tier web architecture), and I have some general questions related to Memcache(d) and Varnish.
What is the difference?
It seems to me that Varnish is behind the web server, caching web pages and doesn't require change in code, just configuration.
On the other side, Memcached is general purpose caching system and mostly used to cache result from database and does require change in get method (first cache lookup).
Can I use both? Varnish in front web server and Memcached for database caching?
What is a better option?
(scenario 1 - mostly write,
scenario 2 - mostly read,
scenario 3 - read and write are similar)

Varnish is in front of the webserver; it works as a reverse http proxy that caches.
You can use both.
Mostly write -- Varnish will need to have affected pages purged. This will result in an overhead and little benefit for modified pages.
Mostly read -- Varnish will probably cover most of it.
Similar read & write -- Varnish will serve a lot of the pages for you, Memcache will provide info for pages that have a mixture of known and new data allowing you to generate pages faster.
An example that could apply to stackoverflow.com: adding this comment invalidated the page cache, so this page would have to be cleared from Varnish (and also my profile page, which probably isn't worth caching to begin with. Remembering to invalidate all affected pages may be a bit of an issue). All the comments, however, are still in Memcache, so the database only has to write this comment. Nothing else needs to be done by the database to generate the page. All the comments are pulled by Memcache, and the page is recached until somebody affects it again (perhaps by voting my answer up). Again, the database writes the vote, all other data is pulled from Memcache, and life is fast.
Memcache saves your DB from doing a lot of read work, Varnish saves your dynamic web server from CPU load by making you generate pages less frequently (and lightens the db load a bit as well if not for Memcache).

My experience comes from using Varnish with Drupal. In as simple terms as possible, here's how I'd answer:
In general, Varnish works for unauthenticated (via cookie) traffic and memcached will cache authenticated traffic.
So use both.

Lightweight HTTP application/server for static content

I am in need of a scalable and performant HTTP application/server that will be used for static file serving/uploading. So I only need support for GET and PUT operations.
However, there are a few extra features that I need:
Custom authentication: I need to
check credentials against a database for each request.
Thus I must be able to integrate propietary
database interaction.
Support for
signed access keys: The access to
resources via PUT should be signed
using a key like http://uri/?key=foo The key then contains information about the request like md5(user + path + secret) which allows me to block unwanted requests. The application/server should allow me to check for this.
Performance: I'd like to avoid piping content as much as possible. Otherwise the whole application could be implemented in Perl/etc. in a few lines as CGI.
Perlbal (in webserver mode) looks nice, however the single-threaded model does not fit with my database lookup and it does also not support query strings.
Lighttp/Nginx/… have some modules for these tasks, however it is not feasible putting everything together without ending up writing own extensions/modules.
So how would you solve this? Are there other leightweight webservers available for this?
Should I implement an application inside of a webserver (i.e. CGI). How can I avoid/speed up piping content between the webserver and my application.
Thanks in advance!

Have a look at nodejs http://nodejs.org/
There are a few modules for static web servers and database interfaces:
http://wiki.github.com/ry/node/modules
You might have to write your own file upload handler, or use one from this example http://www.componentix.com/blog/13

nginx + spawn-fcgi + fcgi application written in C + memcached + sqlite serves for similar task well, latency is about 20-30 ms for small data and fast connections from the same local network. As far as I know production server handles about 100-150 requests per second with no problem. On test server I peaked up to 20k requests per second, again with no problem, average latency were about 60 ms. Aggressive caching and UNIX domain sockets is the key.
Do not know how that configuration will behave on frequent PUT requests, in our task they are very rare and typically batched.