When do we use refetching dispatcher flush agents and what is the purpose of using them?
I didn't find much info about this in AEM documentation.
The reason for using a refetch flush agent is to make sure your pages are cached on the dispatcher immediately after replication.
With a plain flush agent, you would flush the cache and the flushed content would only be retrieved from the publisher again after it is first requested. This creates a potential risk because if your website suddenly gets a peak of high traffic, it's possible for many requests for previously flushed pages to hit the Publisher in a very short period of time. For example, you flush a lot of pages at night when the traffic is low and in the morning, your users start coming to the site to see what's new. In this scenario, it's likely for the Dispatcher to receive multiple concurrent requests for the same page and forward them to the Publisher so you're looking at more than a single request per page.
To quote the Adobe documentation:
Deleting cached files ins this manner is appropraite for web sites that are not likely to receive simultaneous requests for the same page.
Using a refetch flush agent allows you to pre-populate the cache as it instructs the Dispatcher to retrieve a page from the Publish instance immediately after the flush occurs. This way the Dispatcher is unlikely to call the Publisher to process multiple concurrent requests for the same content and you're in control of when the re-fetch happens. Any potential increase in traffic that happens later on will just result in pages being served from the Dispatcher cache without affecting the Publish instance.
Refetch agents give you more control over when the Publish instance is supposed to render the pages. You're in control of the replication events and you know when the pages will have to be rendered by the Publish instance. For example, you can do a refetch flush at night, when the traffic is low and make sure every page gets cached overnight before actual users start calling your site, increasing the load on the servers.
To quote the docs again:
Delete and immediately re-cache files when web sites are likely to receive simultaneous client requests for the same page. Immediate recaching ensures that Dispatcher retrieves and caches the page only once, instead of once for each of the simultaneous client requests.
A word of warning. You have to be very careful about using refetch agents when trying to replicate a large portion of content or if your custom AEM code is not very fast. If you activate a lot of pages at the same time, you may end up performing a DDOS attack on yourself with the dispatcher killing the Publisher with a very large number of requests. The effects will be different depending on the performance of your AEM code. Flushing all of your content with an immediate refetch at the same time is a very bad idea, especially if your site requires a lot of resources to render a page.
Related
The mapping between the dispatcher and the publisher is very important in designing the application. There are two ways,
One to One -> One pub is connect to one dispatcher
One to Many -> One pub is connect to 3 or more dispatcher
I could not understand which one should be selected on when. Can anyone tell me pros and cons on each options?
In general publisher and dispatcher have a different role in your setup. Of both of them you need as many as you have load. In theory you can start with 2 of them. Whenever they cannot handle the load (CPU or Disk over 100%), then you add one of them. (actually AEMaaCS is doing it that way dynamically)
With some experience you can forecast the number of required dispatcher and publishers.
The following scenarios will cause a high load on the dispatchers:
many static pages (which seldom change), and a lot of static assets (images, pdf, ...)
few pages and extremely high traffic for those
In general your site is very good cacheable. Because the dispatcher is a cache in front of the "CMS". Then you probably need several dispatchers for each publisher = one to many (good caching is great, because the dispatcher is cheaper and can handle more load than a publisher)
The following scenarios will cause a higher load on the publisher. Then you will have a one to one scenario
There is a CDN in front of the CMS. The CDN does a lot of static caching, so cache ratio of the dispatcher will go down
A lot of static content is already handled outside of the CMS (e.g. images are served elsewhere, e.g. Adobe Dynamic Media)
You have many dynamic pages (rendered for each user seperately, e.g. a banking application)
PS: you will have at least one dispatcher for each publisher. As reverse proxy it has an imported security function. It also is a major backup to avoid downtimes. I know a customer, that is running during maintenance up to 24 hours only the dispatchers. Then they just serve the static content like a normal Apache webserver.
I was looking at service worker practices and workbox.
There are many articles talking about precaching, workbox even provides special method precachingAndRoute() for just that. I guess I understand the conceptual difference between precache and runtime cache, but what confuses me is why precache is treated so specially?
All articles I've read about precaching emphasize how it makes web app available when client is offline. Isn't that what cache (even it's not precache) is for? I mean it seems that runtime cache can also achieve just that if configured properly. Does it have to be precache to have web app work offline?
The only obvious difference is when the caches are to be created. Well, if client is offline, no cache can be created, no matter it is a precache or runtime cache, and if caches were created during last visit when client was online, how does it matter whether the cache to respond with for current visit was a precache or runtime cache?
Consider 2 abstract cases for compare. Say we have two different service workers, one (/precache/sw.js) only does precache and the other (/runtime/sw.js) only does runtime cache, where /precache and /runtime host same web app (meaning same assets to be cached).
Under what scenario, web app /precache and /runtime could run differently due to different sw setup?
In my understanding,
If cache can not be created (e.g. offline on first visit), then precache and runtime cache shouldn't be any different.
If precache can be created successfully (i.e. client is online on
first visit), runtime cache should too. (Let's not go too wild with
cases like the client may be online only for some certain moment, they still should be the same in my examples.)
If cache are available, then precache and runtime cache have nothing to do, hence are still the same.
The only scenario I could think of when precache shows advantages, is when cache need to be updated on current visit, where precache makes sure current visit get up to date info. If this is the case, wouldn't a NetworkFirst runtime cache do just about the same? And still, there are nothing to do with "offline", what almost every article I've read about sw precaching would mention.
How online/offline makes precache a hero?
What did I miss here, what's so special about precaching?
One scenario where it is different could be the following.
What the app is like:
You have a landing page for your app.
You have a handful of routes that can be navigated to
Cache Strat:
If the user goes to the landing page, only the landing page assets would get cached.
Pre-cache Strat:
If the user goes to the landing page, all of the configured pre-cached assets would get cached.
Difference:
So if the user only goes to the landing page, and then later goes offline, the pre-cache strat would allow them to navigate and interact in some way with the other routes of your app, while the cached strat would not allow any navigation to the other routes.
First, your side by side service workers are restricted to those folders or paths. So they are isolated from each other.
Second, you should define a caching strategy for your application that has a mixture of preCached assets as well as dynamic plus an invalidation routine/logic.
You want to preCache as much as possible without breaking any dynamic nature of your application. So cache common JS, CSS, images, fonts and pages that are used over and over.
Of course have an invalidation strategy in place to keep these up to date.
Next handle non-cached network addressable resources (URLs) from the fetch event handler. Cache them as it makes sense. And invalidate cached assets as it makes sense.
For some applications I cache the entire thing. They are usually on the small side, a few dozen to a few hundred pages for example. For a site like Amazon I would never do that LOL. No mater how much is cached I always have an invalidation and update strategy that makes sense for the application/site.
I added a few workbox.routing.registerRoute using staleWhileRevalidate to my app and so far it has passed most lighthouse tests under PWA. I am not currently using Precaching at all. My question is, is it mandatory? What am I missing without Precaching? workbox.routing.registerRoute is already caching everything I need. Thanks!
Nothing is mandatory. :-)
Using stale-while-revalidate for all of your assets, as well as for your HTML, is definitely a legitimate approach. It means that you don't have to do anything special as part of your build process, for instance, which could be nice in some scenarios.
Whenever you're using a strategy that reads from the cache, whether it's via precaching or stale-while-revalidate, there's going to be some sort of revalidation step to ensure that you don't end up serving out of date responses indefinitely.
If you use Workbox's precaching, that revalidation is efficient, in that the browser only needs to make a single request for your generated service-worker.js file, and that response serves as the source of truth for whether anything precached actually changed. Assuming your precached assets don't change that frequently, the majority of the time your service-worker.js will be identical to the last time it was retrieved, and there won't be any further bandwidth or CPU cycles used on updating.
If you use runtime caching with a stale-while-revalidate policy for everything, then that "while-revalidate" step happens for each and every response. You'll get the "stale" response back to the page almost immediately, so your overall performance should still be good, but you're incurring extra requests made by your service worker "in the background" to re-fetch each URL, and update the cache. There's an increase in bandwidth and CPU cycles used in this approach.
Apart from using additional resources, another reason you might prefer precaching to stale-while-revalidate is that you can populate your full cache ahead of time, without having to wait for the first time they're accessed. If there are certain assets that are only used on a subsection of your web app, and you'd like those assets to be cached ahead of time, that would be trickier to do if you're only doing runtime caching.
And one more advantage offered by precaching is that it will update your cache en masse. This helps avoid scenarios where, e.g., one JavaScript file was updated by virtue of being requested on a previous page, but then when you navigate to the next page, the newer JavaScript isn't compatible with the DOM provided by your stale HTML. Precaching everything reduces the chances of these versioning mismatches from happening. (Especially if you do not enable skipWaiting.)
I want to find out if there is any difference between CQ dispatcher cache flush (from publish instance) and dispatcher cache invalidation?
Any help please?
Dispatcher is a reverse proxy server that can cache data from HTTP source. In case of AEM, it's normally the publisher or the author. Although, in theory, it can even be any resource provider. This backend is called a "Renderer".
Cache invalidation is a HTTP operation triggered by the publisher to mark the cache of a resource as invalid on the dispatcher. This operation will only delete the resource(s) but it will not refresh the resource.
Flush is the workflow associated to publishing of the page and invalidating the cache from publisher/author instance when a new content/resource is published. It is very common scenarion to invalidate the cache during publish so that new content is available for your site.
There are scenarios, where you want to refresh the cache without re-publishing the content. For example, after a release you might want to regenerate all the pages from the publisher as the changes are not editorial changes and hence none of the authors will be willing to publish the content. In this case, you will simply invaidate the cache without using the publish workflow. Although, in practice, it's generally easier to zap the cache directory on dispatcher rather than flushing all the pages but that's a preference. This is where the separation of flush and invalidation really matters and apart from that nothing is really different as the end result is almost the same.
This Adobe article seems to use "flush" and "invalidate" interchangeably.
It says:
Manually Invalidating the Dispatcher Cache
To invalidate (or flush) the Dispatcher cache without activating a
page, you can issue an HTTP request to the dispatcher. For example,
you can create a CQ application that enables administrators or other
applications to flush the cache.
The HTTP request causes Dispatcher to delete specific files from the
cache. Optionally, the Dispatcher then refreshes the cache with a new
copy.
It also talks about configuring a "Dispatcher Flush" agent, and the config for that agent invoked an HTTP request that has "invalidate.cache" in the URL.
CQ basically calls the "Dispatcher Flush Rule Service" from the OSGI, which calls the Replication action type as "Invalidate Catch". So this means To flush the catch CQ replication agents call the action which is called as invalidate catch.
The term is little confusing but its just service and action combination in OSGI.
There are two things, through which cache is modified-
1. Content update
2. Auto-Invalidation
Content update comes into picture when any AEM page is modified.
Auto invalidation is used when there are many automatically generated pages, so the dispatcher flush agent checks for the latest version of files, and marks the files out of date accordingly, by modifying the stat file.
I am noticing latency in REST data the first time I visit a web site that is being served via Azure Mobile Services. Is there a cache or timeout of a connection after a set amount of time, because I am worried about user experience while waiting 7-8 seconds for the data to load (and there is not a lot of data, as I am testing 10 records returned). Once the first connection is made, subsequent visits appear to load quickly... but if I don't visit the site for a while, we are back to 7-8 seconds on first load.
Reason: The reason for this latency is the "shared" mode. When the first call to the service is made, it performs a "cold start" (initializing and starting the virtual server etc)
As you described in your question, after a while when the service is not used, it is put into the "sleep mode" again.
Solution: If you do not want this waiting-time, you can set your service to "reserved" mode, which forces the service to be active all time even when you do not access it for a while. But be aware that this requires you to pay some extra fees.