I want to find out if there is any difference between CQ dispatcher cache flush (from publish instance) and dispatcher cache invalidation?
Any help please?
Dispatcher is a reverse proxy server that can cache data from HTTP source. In case of AEM, it's normally the publisher or the author. Although, in theory, it can even be any resource provider. This backend is called a "Renderer".
Cache invalidation is a HTTP operation triggered by the publisher to mark the cache of a resource as invalid on the dispatcher. This operation will only delete the resource(s) but it will not refresh the resource.
Flush is the workflow associated to publishing of the page and invalidating the cache from publisher/author instance when a new content/resource is published. It is very common scenarion to invalidate the cache during publish so that new content is available for your site.
There are scenarios, where you want to refresh the cache without re-publishing the content. For example, after a release you might want to regenerate all the pages from the publisher as the changes are not editorial changes and hence none of the authors will be willing to publish the content. In this case, you will simply invaidate the cache without using the publish workflow. Although, in practice, it's generally easier to zap the cache directory on dispatcher rather than flushing all the pages but that's a preference. This is where the separation of flush and invalidation really matters and apart from that nothing is really different as the end result is almost the same.
This Adobe article seems to use "flush" and "invalidate" interchangeably.
It says:
Manually Invalidating the Dispatcher Cache
To invalidate (or flush) the Dispatcher cache without activating a
page, you can issue an HTTP request to the dispatcher. For example,
you can create a CQ application that enables administrators or other
applications to flush the cache.
The HTTP request causes Dispatcher to delete specific files from the
cache. Optionally, the Dispatcher then refreshes the cache with a new
copy.
It also talks about configuring a "Dispatcher Flush" agent, and the config for that agent invoked an HTTP request that has "invalidate.cache" in the URL.
CQ basically calls the "Dispatcher Flush Rule Service" from the OSGI, which calls the Replication action type as "Invalidate Catch". So this means To flush the catch CQ replication agents call the action which is called as invalidate catch.
The term is little confusing but its just service and action combination in OSGI.
There are two things, through which cache is modified-
1. Content update
2. Auto-Invalidation
Content update comes into picture when any AEM page is modified.
Auto invalidation is used when there are many automatically generated pages, so the dispatcher flush agent checks for the latest version of files, and marks the files out of date accordingly, by modifying the stat file.
Related
When team will execute a test of performance on AEM Pre prod author servers then immediately author server load is increase highly and response time getting more time on server. so here we need a solution for how to reduce this author server response time when test cases are executing.
Add a dispatcher in front of the author server. Though usually intended for publish instances, dispatchers can be setup for authors as well, and alleviate the load on the server returning common files. Of course, make sure to configure it properly to avoid caching editable content.
I noticed that when using curl to get content from github using this format:
https://raw.githubusercontent.com/${org}/${repo}/${branch}/path/to/file
It will sometimes return cached/stale content. For example with this sequence of operations:
curl https://raw.githubusercontent.com/${org}/${repo}/${branch}/path/to/file
Push a new commit to that branch
curl https://raw.githubusercontent.com/${org}/${repo}/${branch}/path/to/file
Step 3 will return the same content as step 1 and not reflect the new commit.
How can avoid getting a stale version?
I noticed on the Github WebUI, it adds a token to the url, eg: ?token=AABCIPALAGOZX5R which presumably avoids getting cached content. What's the nature of this token and how can I emulate this? Would tacking on ?token=$(date +%s) work?
Also I'm looking for a way to avoid the stale content without having to switch to a commit hash in the url, since it will require more changes. However, if that's the only way to achieve it, then I'll go that route.
GitHub caches this data because otherwise frequently requested files would involve serving a request to the backend service each time and this is more expensive than serving a cached copy. Using a CDN provides improved performance and speed. You cannot bypass it.
The token you're seeing in the URL is a temporary token that is issued for the logged-in user. You cannot use a random token, since that won't pass authentication.
If you need the version of that file in a specific commit, then you'll need to explicitly specify that commit. However, do be aware that you should not do this with some sort of large-scale automated process as a way to bypass caching. For example, you should not try to do this to always get the latest version of a file for the purposes of a program you're distributing or multiple instances of a service you're running. You should provide that data yourself, using a CDN if necessary. That way, you can decide for yourself when the cache needs to be expired and get both good performance and the very latest data.
If you run such a process anyway, you may cause an outage or overload, and your repository or account may be suspended or blocked.
I was looking at service worker practices and workbox.
There are many articles talking about precaching, workbox even provides special method precachingAndRoute() for just that. I guess I understand the conceptual difference between precache and runtime cache, but what confuses me is why precache is treated so specially?
All articles I've read about precaching emphasize how it makes web app available when client is offline. Isn't that what cache (even it's not precache) is for? I mean it seems that runtime cache can also achieve just that if configured properly. Does it have to be precache to have web app work offline?
The only obvious difference is when the caches are to be created. Well, if client is offline, no cache can be created, no matter it is a precache or runtime cache, and if caches were created during last visit when client was online, how does it matter whether the cache to respond with for current visit was a precache or runtime cache?
Consider 2 abstract cases for compare. Say we have two different service workers, one (/precache/sw.js) only does precache and the other (/runtime/sw.js) only does runtime cache, where /precache and /runtime host same web app (meaning same assets to be cached).
Under what scenario, web app /precache and /runtime could run differently due to different sw setup?
In my understanding,
If cache can not be created (e.g. offline on first visit), then precache and runtime cache shouldn't be any different.
If precache can be created successfully (i.e. client is online on
first visit), runtime cache should too. (Let's not go too wild with
cases like the client may be online only for some certain moment, they still should be the same in my examples.)
If cache are available, then precache and runtime cache have nothing to do, hence are still the same.
The only scenario I could think of when precache shows advantages, is when cache need to be updated on current visit, where precache makes sure current visit get up to date info. If this is the case, wouldn't a NetworkFirst runtime cache do just about the same? And still, there are nothing to do with "offline", what almost every article I've read about sw precaching would mention.
How online/offline makes precache a hero?
What did I miss here, what's so special about precaching?
One scenario where it is different could be the following.
What the app is like:
You have a landing page for your app.
You have a handful of routes that can be navigated to
Cache Strat:
If the user goes to the landing page, only the landing page assets would get cached.
Pre-cache Strat:
If the user goes to the landing page, all of the configured pre-cached assets would get cached.
Difference:
So if the user only goes to the landing page, and then later goes offline, the pre-cache strat would allow them to navigate and interact in some way with the other routes of your app, while the cached strat would not allow any navigation to the other routes.
First, your side by side service workers are restricted to those folders or paths. So they are isolated from each other.
Second, you should define a caching strategy for your application that has a mixture of preCached assets as well as dynamic plus an invalidation routine/logic.
You want to preCache as much as possible without breaking any dynamic nature of your application. So cache common JS, CSS, images, fonts and pages that are used over and over.
Of course have an invalidation strategy in place to keep these up to date.
Next handle non-cached network addressable resources (URLs) from the fetch event handler. Cache them as it makes sense. And invalidate cached assets as it makes sense.
For some applications I cache the entire thing. They are usually on the small side, a few dozen to a few hundred pages for example. For a site like Amazon I would never do that LOL. No mater how much is cached I always have an invalidation and update strategy that makes sense for the application/site.
I added a few workbox.routing.registerRoute using staleWhileRevalidate to my app and so far it has passed most lighthouse tests under PWA. I am not currently using Precaching at all. My question is, is it mandatory? What am I missing without Precaching? workbox.routing.registerRoute is already caching everything I need. Thanks!
Nothing is mandatory. :-)
Using stale-while-revalidate for all of your assets, as well as for your HTML, is definitely a legitimate approach. It means that you don't have to do anything special as part of your build process, for instance, which could be nice in some scenarios.
Whenever you're using a strategy that reads from the cache, whether it's via precaching or stale-while-revalidate, there's going to be some sort of revalidation step to ensure that you don't end up serving out of date responses indefinitely.
If you use Workbox's precaching, that revalidation is efficient, in that the browser only needs to make a single request for your generated service-worker.js file, and that response serves as the source of truth for whether anything precached actually changed. Assuming your precached assets don't change that frequently, the majority of the time your service-worker.js will be identical to the last time it was retrieved, and there won't be any further bandwidth or CPU cycles used on updating.
If you use runtime caching with a stale-while-revalidate policy for everything, then that "while-revalidate" step happens for each and every response. You'll get the "stale" response back to the page almost immediately, so your overall performance should still be good, but you're incurring extra requests made by your service worker "in the background" to re-fetch each URL, and update the cache. There's an increase in bandwidth and CPU cycles used in this approach.
Apart from using additional resources, another reason you might prefer precaching to stale-while-revalidate is that you can populate your full cache ahead of time, without having to wait for the first time they're accessed. If there are certain assets that are only used on a subsection of your web app, and you'd like those assets to be cached ahead of time, that would be trickier to do if you're only doing runtime caching.
And one more advantage offered by precaching is that it will update your cache en masse. This helps avoid scenarios where, e.g., one JavaScript file was updated by virtue of being requested on a previous page, but then when you navigate to the next page, the newer JavaScript isn't compatible with the DOM provided by your stale HTML. Precaching everything reduces the chances of these versioning mismatches from happening. (Especially if you do not enable skipWaiting.)
When do we use refetching dispatcher flush agents and what is the purpose of using them?
I didn't find much info about this in AEM documentation.
The reason for using a refetch flush agent is to make sure your pages are cached on the dispatcher immediately after replication.
With a plain flush agent, you would flush the cache and the flushed content would only be retrieved from the publisher again after it is first requested. This creates a potential risk because if your website suddenly gets a peak of high traffic, it's possible for many requests for previously flushed pages to hit the Publisher in a very short period of time. For example, you flush a lot of pages at night when the traffic is low and in the morning, your users start coming to the site to see what's new. In this scenario, it's likely for the Dispatcher to receive multiple concurrent requests for the same page and forward them to the Publisher so you're looking at more than a single request per page.
To quote the Adobe documentation:
Deleting cached files ins this manner is appropraite for web sites that are not likely to receive simultaneous requests for the same page.
Using a refetch flush agent allows you to pre-populate the cache as it instructs the Dispatcher to retrieve a page from the Publish instance immediately after the flush occurs. This way the Dispatcher is unlikely to call the Publisher to process multiple concurrent requests for the same content and you're in control of when the re-fetch happens. Any potential increase in traffic that happens later on will just result in pages being served from the Dispatcher cache without affecting the Publish instance.
Refetch agents give you more control over when the Publish instance is supposed to render the pages. You're in control of the replication events and you know when the pages will have to be rendered by the Publish instance. For example, you can do a refetch flush at night, when the traffic is low and make sure every page gets cached overnight before actual users start calling your site, increasing the load on the servers.
To quote the docs again:
Delete and immediately re-cache files when web sites are likely to receive simultaneous client requests for the same page. Immediate recaching ensures that Dispatcher retrieves and caches the page only once, instead of once for each of the simultaneous client requests.
A word of warning. You have to be very careful about using refetch agents when trying to replicate a large portion of content or if your custom AEM code is not very fast. If you activate a lot of pages at the same time, you may end up performing a DDOS attack on yourself with the dispatcher killing the Publisher with a very large number of requests. The effects will be different depending on the performance of your AEM code. Flushing all of your content with an immediate refetch at the same time is a very bad idea, especially if your site requires a lot of resources to render a page.