I have a question about page replacement algorithms. FIFO suffers from Belady's Anomaly but LRU doesn't. Does anyone know why LRU doesn't suffer? I've been searching for the reason on the internet but no luck.
Because LRU is a stacking algorithm, and using k frames will always be a subset of k + n frames for LRU. Thus, any page-faults that may occur for k + n frames will also occur for k frames, which in turn means that LRU doesn't suffer Belady's anomaly.
Because FIFO assumes that the mere fact that a page has been occupying memory for a long time that it is the safest to replace, where in reality that simply isn't the case. Rather, where FIFO fails is that statistically, if a page has been called frequently, it's more likely to be called again than another page which has been called recently. In other words, frequency is a far better determiner of page loading than age.
Similar to Caspar's answer, however I found the explanation from my textbook (slightly edited) to be a bit more clear.
[LRU belongs] to a class of page-replacement algorithms, called stack algorithms, [which] can never exhibit Belady’s anomaly.
A stack algorithm is an algorithm for which it can be shown that the set of pages in memory for N frames is always a subset of the set of pages that would be in memory with N + 1 frames. [Therefore an additional frame will never cause an additional page fault.]
For LRU replacement, the set of pages in memory would be the N most recently referenced pages. If the number of frames is increased, these N pages will still be the most recently referenced and so will still be in memory.
Silberschatz, A., Galvin, P. B., & Gagne, G. (2014). Operating System Concepts (9th ed.). Singapore: Wiley.
Related
I have a book statement:
Implementation of LRU in full associative TLB is very expensive, so the general way is to use random substitution.
I don't understand why it's expensive under full associative cache. Isn't that just adding an additional reference bit...?
LRU requires maintaining a total order relation between all valid cache lines in a cache set. For example, consider a 3-way cache set with the following lines A, B, and C ordered from the most recently accessed to the least recently accessed (represented as ABC). If C is accessed next, then the order becomes CAB. If a new line, D, needs to be filled in the same cache set, since there are no invalid lines, the LRU replacement policy will choose B to be evicted and replaced by the new line. Then the order becomes DCA.
For a 3-way cache, there are up to 3*2 = 6 possible orders for the lines in each set. In general, for an N-way cache, there are up to N! (N factorial) possible orders. Theoretically, you need at least log2(N!) bits (rounded up to the nearest integer) per cache set to maintain the LRU property accurately. Note that log2(N!) is Θ(Nlog(N)), so it grows superlinearly with respect to the number of ways. No normal person likes anything whose cost grows superlinearly.
A particularly cheap case is a 2-way cache, where the LRU state requires only log2(2!) = 1 bits, i.e., a single bit. It is much more expensive for any other number of ways though.
In practice, though, there is no easy way to maintain a single number that represents the LRU state of a set. If the current LRU state is X and then some access to a line occurs, how can the next LRU state be determined? There is no simple mathematical relation that can be implemented in hardware. So instead of using a single number, a realistic implementation would use multiple numbers, one per cache line. In this case, these numbers are called ages. Such design would even require (many) more bits than the theoretical minimum log2(N!) to maintain the LRU state.
Aside from the hardware overhead, the LRU replacement policy is not necessarily optimal for performance. It depends on the memory access patterns of the applications in the target market domain and the rest of the cache hierarchy.
LRU has been used in many real processors. Caches that are 2-way associative typically use LRU. For example, AMD SledgeHammer uses LRU for both L1I and L1D caches. The Itanium 2 processor's L1 instruction cache uses LRU and it is 4-way associative. Usually, when the number of ways is larger than two, caches don't use LRU.
I have a bunch of threads with a bunch of counters. Threads decrement the counters, and interesting things happen if a counter hits zero. This is trivial to implement with atomic ops.
However, it gets harder if we require two properties to hold regardless of the number of threads or counters:
Scalability: Decrementing a counter is O(polylog).
Compactness: The memory per counter is O(1).
I know how to do either one of these in isolation: the trivial implementation is compact and hierarchical counting networks [4] are scalable). Is it possible to do both?
Note: Since O(n) threads can't make O(n) different changes O(1) memory in time less than O(n), solving this requires sharing data structure between the different counters.
[4]: J. Aspnes, M. Herlily, and N. Shavit. Counting networks. Journal of the ACM, 41(5):1020-1048, Sept 1994.
Update: Jed Brown pointed out the obvious fact that O(1) time is impossible. Changed to polylog.
Have you tried Dr. Cliff Click's ConcurrentAutoTable (Counter) from the high-scale-lib:
http://sourceforge.net/projects/high-scale-lib/files/high-scale-lib/high-scale-lib-v1.1.1/
http://www.youtube.com/watch?v=WYXgtXWejRM
http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
http://www.infoq.com/news/2008/05/click_non_blocking/
http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf
There was a paper for scalable counters. You basically have a tree, where each thread has a node, and a thread wishing to inc/dec posts that fact, and then it begins to climb the tree, up to the counter, which is at the top, accumulating inc/dec values on the way, then applying the totals to the counter at the top. (That's the gist of it - lots of extra detail).
It distributes the inc/dec away from a single cache line, which of course is something that prevents scalability.
Check out the white papers in the wiki at http://www.liblfds.org - you'll find it there.
I have a project where I am asked to develop an application to simulate how different page replacement algorithms perform (with varying working set size and stability period). My results:
Vertical axis: page faults
Horizontal axis: working set size
Depth axis: stable period
Are my results reasonable? I expected LRU to have better results than FIFO. Here, they are approximately the same.
For random, stability period and working set size doesnt seem to affect the performance at all? I expected similar graphs as FIFO & LRU just worst performance? If the reference string is highly stable (little branches) and have a small working set size, it should still have less page faults that an application with many branches and big working set size?
More Info
My Python Code | The Project Question
Length of reference string (RS): 200,000
Size of virtual memory (P): 1000
Size of main memory (F): 100
number of time page referenced (m): 100
Size of working set (e): 2 - 100
Stability (t): 0 - 1
Working set size (e) & stable period (t) affects how reference string are generated.
|-----------|--------|------------------------------------|
0 p p+e P-1
So assume the above the the virtual memory of size P. To generate reference strings, the following algorithm is used:
Repeat until reference string generated
pick m numbers in [p, p+e]. m simulates or refers to number of times page is referenced
pick random number, 0 <= r < 1
if r < t
generate new p
else (++p)%P
UPDATE (In response to #MrGomez's answer)
However, recall how you seeded your input data: using random.random,
thus giving you a uniform distribution of data with your controllable
level of entropy. Because of this, all values are equally likely to
occur, and because you've constructed this in floating point space,
recurrences are highly improbable.
I am using random, but it is not totally random either, references are generated with some locality though the use of working set size and number page referenced parameters?
I tried increasing the numPageReferenced relative with numFrames in hope that it will reference a page currently in memory more, thus showing the performance benefit of LRU over FIFO, but that didn't give me a clear result tho. Just FYI, I tried the same app with the following parameters (Pages/Frames ratio is still kept the same, I reduced the size of data to make things faster).
--numReferences 1000 --numPages 100 --numFrames 10 --numPageReferenced 20
The result is
Still not such a big difference. Am I right to say if I increase numPageReferenced relative to numFrames, LRU should have a better performance as it is referencing pages in memory more? Or perhaps I am mis-understanding something?
For random, I am thinking along the lines of:
Suppose theres high stability and small working set. It means that the pages referenced are very likely to be in memory. So the need for the page replacement algorithm to run is lower?
Hmm maybe I got to think about this more :)
UPDATE: Trashing less obvious on lower stablity
Here, I am trying to show the trashing as working set size exceeds the number of frames (100) in memory. However, notice thrashing appears less obvious with lower stability (high t), why might that be? Is the explanation that as stability becomes low, page faults approaches maximum thus it does not matter as much what the working set size is?
These results are reasonable given your current implementation. The rationale behind that, however, bears some discussion.
When considering algorithms in general, it's most important to consider the properties of the algorithms currently under inspection. Specifically, note their corner cases and best and worst case conditions. You're probably already familiar with this terse method of evaluation, so this is mostly for the benefit of those reading here whom may not have an algorithmic background.
Let's break your question down by algorithm and explore their component properties in context:
FIFO shows an increase in page faults as the size of your working set (length axis) increases.
This is correct behavior, consistent with Bélády's anomaly for FIFO replacement. As the size of your working page set increases, the number of page faults should also increase.
FIFO shows an increase in page faults as system stability (1 - depth axis) decreases.
Noting your algorithm for seeding stability (if random.random() < stability), your results become less stable as stability (S) approaches 1. As you sharply increase the entropy in your data, the number of page faults, too, sharply increases and propagates the Bélády's anomaly.
So far, so good.
LRU shows consistency with FIFO. Why?
Note your seeding algorithm. Standard LRU is most optimal when you have paging requests that are structured to smaller operational frames. For ordered, predictable lookups, it improves upon FIFO by aging off results that no longer exist in the current execution frame, which is a very useful property for staged execution and encapsulated, modal operation. Again, so far, so good.
However, recall how you seeded your input data: using random.random, thus giving you a uniform distribution of data with your controllable level of entropy. Because of this, all values are equally likely to occur, and because you've constructed this in floating point space, recurrences are highly improbable.
As a result, your LRU is perceiving each element to occur a small number of times, then to be completely discarded when the next value was calculated. It thus correctly pages each value as it falls out of the window, giving you performance exactly comparable to FIFO. If your system properly accounted for recurrence or a compressed character space, you would see markedly different results.
For random, stability period and working set size doesn't seem to affect the performance at all. Why are we seeing this scribble all over the graph instead of giving us a relatively smooth manifold?
In the case of a random paging scheme, you age off each entry stochastically. Purportedly, this should give us some form of a manifold bound to the entropy and size of our working set... right?
Or should it? For each set of entries, you randomly assign a subset to page out as a function of time. This should give relatively even paging performance, regardless of stability and regardless of your working set, as long as your access profile is again uniformly random.
So, based on the conditions you are checking, this is entirely correct behavior consistent with what we'd expect. You get an even paging performance that doesn't degrade with other factors (but, conversely, isn't improved by them) that's suitable for high load, efficient operation. Not bad, just not what you might intuitively expect.
So, in a nutshell, that's the breakdown as your project is currently implemented.
As an exercise in further exploring the properties of these algorithms in the context of different dispositions and distributions of input data, I highly recommend digging into scipy.stats to see what, for example, a Gaussian or logistic distribution might do to each graph. Then, I would come back to the documented expectations of each algorithm and draft cases where each is uniquely most and least appropriate.
All in all, I think your teacher will be proud. :)
My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.
Should I avoid recursion with code that runs on the iPhone?
Or put another way, does anyone know the max stack size on the iphone?
Yes, avoiding recursion is a good thing on all embedded platforms.
Not only does it lowers or even removes the chance of a stack-overflow, it often gives you faster code as well.
You can always rewrite a recursive algorithm to be iterative. That's not always practical though (think quicksort). A way to get around this is to rewrite the algorithms in a way that the recursion depth is limited.
The introsort is a perfect example how it's done in practice. It limits the recursion depth of a quicksort to log2 (number-of-elements). So on a 32 bit machine you will never recurse deeper than 32.
http://en.wikipedia.org/wiki/Introsort
I've written quite a bit of software for embedded platforms in the past (car entertainment systems, phones, game-consoles and the like) and I always made sure that I put a upper limit on the recursion depth or avoided recursion at the first place.
As a result none of my programs ever died with a stack-overflow and most programs are happy with 32kb of stack. This pays off big time once you need multiple threads as each thread gets it's own stack.. You can save megabytes of memory that way.
I see a couple of answers that boil down to "don't use recursion". I disagree - it's not like the iPhone is some severely-constrained embedded system. If a problem is inherently recursive, feel free to express it that way.
Unless you're recursing to a stack depth of hundreds or thousands of frames, you'll never have an issue.
The max stack size on the iphone?
The iPhone runs a modified OSX in which every process is given a valid memory space, just like in most operating systems.
It's a full processor, so stack grows up, and heap grows down (or vice versa depending on your perspective). This means that you won't overflow the stack until you run out of memory allocated to your program.
It's best to avoid recursion when you can for stack and performance reasons (function calls are expensive, relative to simple loops), but in any case you should decide what limits you can place on recursive functions, and truncate them if they go too long.