Does retaining modified pages in the page cache make more sense (over unmodified)? - operating-system

I am working on a page cache replacement policy, I read many existing algorithms most of them are prefer to retain modified pages in cache. I don't really understand the reason behind it. Is it due to eviction cost or modified pages have higher chance of being used again?

Out of many different policies LRU(least recently used) policy provides good result with hardware support.
Is it due to eviction cost or modified pages have higher chance of being used again?
Yes
So according to locality of reference the recently modified page has more chances of being referenced again.
One more reason of retaining modified page in cache is that every page replacement of modified page (which has higher chances of being referenced again)requires two transfers. Firstly it is written into disk and secondly requested page comes in main memory. This very costly. But in case of non modified page (which has low chances of being referenced) only one transfer takes place i.e. requested page comes in memory.

Related

Precaching with service worker, why does it matter? What did I miss?

I was looking at service worker practices and workbox.
There are many articles talking about precaching, workbox even provides special method precachingAndRoute() for just that. I guess I understand the conceptual difference between precache and runtime cache, but what confuses me is why precache is treated so specially?
All articles I've read about precaching emphasize how it makes web app available when client is offline. Isn't that what cache (even it's not precache) is for? I mean it seems that runtime cache can also achieve just that if configured properly. Does it have to be precache to have web app work offline?
The only obvious difference is when the caches are to be created. Well, if client is offline, no cache can be created, no matter it is a precache or runtime cache, and if caches were created during last visit when client was online, how does it matter whether the cache to respond with for current visit was a precache or runtime cache?
Consider 2 abstract cases for compare. Say we have two different service workers, one (/precache/sw.js) only does precache and the other (/runtime/sw.js) only does runtime cache, where /precache and /runtime host same web app (meaning same assets to be cached).
Under what scenario, web app /precache and /runtime could run differently due to different sw setup?
In my understanding,
If cache can not be created (e.g. offline on first visit), then precache and runtime cache shouldn't be any different.
If precache can be created successfully (i.e. client is online on
first visit), runtime cache should too. (Let's not go too wild with
cases like the client may be online only for some certain moment, they still should be the same in my examples.)
If cache are available, then precache and runtime cache have nothing to do, hence are still the same.
The only scenario I could think of when precache shows advantages, is when cache need to be updated on current visit, where precache makes sure current visit get up to date info. If this is the case, wouldn't a NetworkFirst runtime cache do just about the same? And still, there are nothing to do with "offline", what almost every article I've read about sw precaching would mention.
How online/offline makes precache a hero?
What did I miss here, what's so special about precaching?
One scenario where it is different could be the following.
What the app is like:
You have a landing page for your app.
You have a handful of routes that can be navigated to
Cache Strat:
If the user goes to the landing page, only the landing page assets would get cached.
Pre-cache Strat:
If the user goes to the landing page, all of the configured pre-cached assets would get cached.
Difference:
So if the user only goes to the landing page, and then later goes offline, the pre-cache strat would allow them to navigate and interact in some way with the other routes of your app, while the cached strat would not allow any navigation to the other routes.
First, your side by side service workers are restricted to those folders or paths. So they are isolated from each other.
Second, you should define a caching strategy for your application that has a mixture of preCached assets as well as dynamic plus an invalidation routine/logic.
You want to preCache as much as possible without breaking any dynamic nature of your application. So cache common JS, CSS, images, fonts and pages that are used over and over.
Of course have an invalidation strategy in place to keep these up to date.
Next handle non-cached network addressable resources (URLs) from the fetch event handler. Cache them as it makes sense. And invalidate cached assets as it makes sense.
For some applications I cache the entire thing. They are usually on the small side, a few dozen to a few hundred pages for example. For a site like Amazon I would never do that LOL. No mater how much is cached I always have an invalidation and update strategy that makes sense for the application/site.

Is Precaching with Workbox mandatory for PWA?

I added a few workbox.routing.registerRoute using staleWhileRevalidate to my app and so far it has passed most lighthouse tests under PWA. I am not currently using Precaching at all. My question is, is it mandatory? What am I missing without Precaching? workbox.routing.registerRoute is already caching everything I need. Thanks!
Nothing is mandatory. :-)
Using stale-while-revalidate for all of your assets, as well as for your HTML, is definitely a legitimate approach. It means that you don't have to do anything special as part of your build process, for instance, which could be nice in some scenarios.
Whenever you're using a strategy that reads from the cache, whether it's via precaching or stale-while-revalidate, there's going to be some sort of revalidation step to ensure that you don't end up serving out of date responses indefinitely.
If you use Workbox's precaching, that revalidation is efficient, in that the browser only needs to make a single request for your generated service-worker.js file, and that response serves as the source of truth for whether anything precached actually changed. Assuming your precached assets don't change that frequently, the majority of the time your service-worker.js will be identical to the last time it was retrieved, and there won't be any further bandwidth or CPU cycles used on updating.
If you use runtime caching with a stale-while-revalidate policy for everything, then that "while-revalidate" step happens for each and every response. You'll get the "stale" response back to the page almost immediately, so your overall performance should still be good, but you're incurring extra requests made by your service worker "in the background" to re-fetch each URL, and update the cache. There's an increase in bandwidth and CPU cycles used in this approach.
Apart from using additional resources, another reason you might prefer precaching to stale-while-revalidate is that you can populate your full cache ahead of time, without having to wait for the first time they're accessed. If there are certain assets that are only used on a subsection of your web app, and you'd like those assets to be cached ahead of time, that would be trickier to do if you're only doing runtime caching.
And one more advantage offered by precaching is that it will update your cache en masse. This helps avoid scenarios where, e.g., one JavaScript file was updated by virtue of being requested on a previous page, but then when you navigate to the next page, the newer JavaScript isn't compatible with the DOM provided by your stale HTML. Precaching everything reduces the chances of these versioning mismatches from happening. (Especially if you do not enable skipWaiting.)

CDN for a RESTful API?

I have a RESTful API with resources updates once a week. That is, for each resource, I update it once and week and allow clients to access it. It's an ever changing calculator.
There are probably 10,000 resources which could be requested.
Is it possible to put something like this behind a CDN? Traditionally CDNs are for undeniably static content, ie images. I'm not sure where my situation sits in the spectrum of dynamic <-> static.
Cheers
90% of the resources might not even get called, and if they are, will
get called a few times only. It wont be a mass of repetitive calls.
Right there in your comments, you just showed me that a CDN is not beneficial to you.
Usually how a CDN works is the first call it is downloaded from the main server to the regional CDN node then delivered to the client meaning the first GET will have no improvements. The following GETs to the same regional node will have the speed improvement. If you have little to no repetitive calls, then you will not see any noticeable improvement.
As I said in the comments, for small files, clients are probably spending as much time on the DNS lookup as they are on the download. Look into a global DNS solution (like Anycast) to reduce connection times. This is easy to setup and requires little to no maintenance.
I think it's entirely reasonable to put it behind a CDN if you think your content will reach the appropriate level of scale. As long as the cache-control headers are set such that the latest content is loaded when the cached version may be stale, you'll be fine.
The main benefit of CDNs comes when resources are requested from a variety of different sources, and so siteY.com can use the same cached version of a resource as siteX.com. Do you anticipate your resources will be requested from various different sources?

Any optimizations in reducing the number of disk accesses for inode number lookup by web-servers?

Web-servers typically have a document root denoting the filesystem sub-tree visible via the web. Consequently for eg., if the document root is: /home/foouser/public_html/, then the web-server would map a request for http://www.foo.com/pics/foo.jpg to /home/foouser/public_html/pics/foo.jpg. This results in a series of disk requests to obtain the inode-number of foo.jpg.
Do web-servers do any optimizations to reduce the number of disk accesses (or) is it the role of the server-admin to set the document root as close to "/" as possible, to reduce the number of disk-accesses in the filename to inode number translation?
I know this isn't directly the answer to your question, but by setting up a caching strategy you can drastically reduce disk reads. Especially if your static content is not hosted on your server.
Options:
Host static content on a CDN:
Pros: Off-load all load onto someone else's network. Cost?
Cons: Potentially less control. Cost?
Use Contendo/Akamai, which is also a CDN, but with some differences.
Pros: Host your content, but after the first read the cdn will handle caching based on the headers you send with your content (static or not)
Cons: Sometimes headers are really annoying to manage. Cache busting (breaking your own cache) can be annoying to handle when you want to replace old content.
Cache things locally. If you are making a DB request for instance you can cache the request. Next time your code is run check your in memory cache first (as opposed to make a db request immediately). You could cache entire pages then at an application controller/route level check if there is a cached version of the page/asset and serve that.
Pros: Lots of control. You can cache almost anything.
Cons: A ton of work to set up caching on every little thing. You need a strategy for every part of your website.
My recommendation is to start out by moving your assets to AmazonS3 or Rackspace or something. Joyent has something for this as well. You could then enable cloudfront for s3 which will turn on the cdn, which caches things in various regions. This is a really cheap solution (depending on the amount of files you have).
You could also go the contendo route.
The caching on the application side route takes quite a bit of work and completely depends on your server/language/db/configuration.

Why the SessionSize in the WicketDebugBar differs from the pagemap serialized on the disk?

I activated the wicket DebugBar in order to trace my session size. When I navigate on the web site, the indicated session size is stable at about 25k.
In the same time, the pagemap serialiazed on the disk continuously grows from about 25k for each page view.
What does that means? From what I understood, the pagemap on disk keeps all the pages. But why the session stays always at about 25k.
What is the impact on a big website. If I have 1000 parallel web sessions, the web server will need 25Mo to hold them and the disk 250Mo (10 pages * 25k * 1000)?
I will make some load test to check.
The debug bar value is telling you the size of your session in memory. As you browse to another page, the old page is serialized to the session store. This provides, among other things, back button support without killing your memory footprint.
So, to answer your first question, the size on disk grows because it is holding historical data while your session stays about the same because it is holding active data.
To answer your second question, its been some time since I have looked at it, but I believe the disk session store is capped at 10MB or so. Furthermore, you can change the behavior of the session store to meet your needs, but that's a whole different discussion.
See this Wiki page which describes the storage mechanisms in Wicket 1.5. It is a bit different than 1.4 but there is no such document for 1.4
Update: the Wiki page has been moved to the guide: https://ci.apache.org/projects/wicket/guide/7.x/guide/internals.html#pagestoring