Difference between instruction reference and data refernce - cpu-architecture

I'm not able to get why 0.4 is used in finding miss rate of data cache rather than using 0.3 as it is also given that 30% of the instructions are data refernces.
(https://i.stack.imgur.com/jJpC1.png)

The given fact is that there are 40 misses per thousand instructions for data cache. Data cache will only be accessed for Load/Store instructions. There are only 40% of Load/Store Instructions. Therefore in order to get the miss rate, you need to divide the value by 0.4.

Related

What is the benefit of having the registers as a part of memory in AVR microcontrollers?

Larger memories have higher decoding delay; why is the register file a part of the memory then?
Does it only mean that the registers are "mapped" SRAM registers that are stored inside the microprocessor?
If not, what would be the benefit of using registers as they won't be any faster than accessing RAM? Furthermore, what would be the use of them at all? I mean these are just a part of the memory so I don't see the point of having them anymore. Having them would be just as costly as referencing memory.
The picture is taken from Avr Microcontroller And Embedded Systems The: Using Assembly and C by Muhammad Ali Mazidi, Sarmad Naimi, and Sepehr Naimi
AVR has some instructions with indirect addressing, for example LD (LDD) – Load Indirect From Data Space to Register using Z:
Loads one byte indirect with or without displacement from the data space to a register. [...]
The data location is pointed to by the Z (16-bit) Pointer Register in the Register File.
So now you can move from a register by loading its data-space address into Z, allowing indirect or indexed register-to-register moves. Certainly one can think of some usage where such indirect access would save the odd instruction.
what would be the benefit of using registers as they won't be any faster than accessing RAM?
accessing General purpose Registers is faster than accessing Ram
first of all let us define how fast measured in microControllers .... fast mean how many cycle the instruction will take to excute ... LOOk at the avr architecture
See the General Purpose Registers GPRs are input for the ALU , and the GPRs are controlled by instruction register (2 byte width) which holds the next instruction from the code memory.
Let us examine simple instruction ADD Rd , Rr; where Rd,Rr are any two register in GPRs so 0<=r,d<=31 so each of r and d could be rebresented in 5 bit,now open "AVR Instruction Set Manual" page number 32 look at the op-code for this simple add instraction is 000011rdddddrrrr and becuse this op-code is two byte(code memory width) this will fetched , Decoded and excuit in one cycle (under consept of pipline ofcourse) jajajajjj only one cycle seems cool to me
I mean these are just a part of the memory so I don't see the point of having them anymore. Having them would be just as costly as referencing memory
You suggest to make the all ram as input for the ALU; this is a very bad idea: a memory address takes 2 bytes.
If you have 2 operands per instruction as in Add instruction you will need 4 Byte for saving only the operands .. and 1 more byte for the op-code of the operator itself in total 5 byte which is waste of memory!
And furthermore this architecture could only fetch 2 bytes at a time (instruction register width) so you need to spend more cycles on fetching the code from code memory which is waste of cycles >> more slower system
Register numbers are only 4 or 5 bits wide, depending on the instruction, allowing 2 per instruction with room to spare in a 16-bit instruction word.
conclusion GPRs' existence are crucial for saving code memory and program execution time
Larger memories have higher decoding delay; why is the register file a part of the memory then?
When cpu deal with GPRs it only access the first 32 position not all the data space
Final comment
don't disturb yourself by time diagram for different ram technology because you don't have control on it ,so who has control? the architecture designers , and they put the limit of the maximum crystal frequency you can use with there architecture and everything will be fine .. you only concern about cycles consuming with your application

SD card lifetime optimization

Simple question:
Which approach is best in terms of prolonging the life expectancy of an SD card?
Writing 10-minute files with 10 Hz lines of data input (~700 kB each)
1) directly to the SD card
or
2) to the internal memory of the device, then moving the file to the SD card
?
The amount of data being written to the SD card remains the same. The question is simply if a lot of tiny file operations (6000 lines written in the course of ten minutes, 100 ms apart) or one file operation moving the entire file containing the 6000 lines onto the card as once is better. Or does it even matter? Of course the card specifications are hugely important as well, but let's leave that out of the discussion.
1) You should only write to fill flash page boundaries discussed here:
https://electronics.stackexchange.com/questions/227686/sd-card-sector-size
2) Keeping fault-tolerant track of how much data is written where also needs to be written. That counts as a write hit on FAT etc as well, on a page that gets more traffic than others. Avoid if possible (ie fdup/fclose/fopen append) techniques which cause buffer and directory cached data to be flushed. But I would use this trick every minute or so so you never lose more than a minute of data on a crash or accidental removal.
3) OS-supported wear leveling will solve the above, if properly implemented. I have read horror stories about flash memories being destroyed in days.
4) Calculate expected life using the total wear-leveled lifetime writes spec of that memory. Usually in TB's. If you see numbers in the decades, don't bother doing more than (1).
5) Which OS and file-system you are using matters somewhat. For example EXT3 is supposedly faster than EXT2 due to less drive access at a slightly higher risk ratio. Since your question doesn't ask about OS/FS you use, I'll leave the rest of that up to you.

Total MongoDB storage size

I have a sharded and replicated MongoDB with dozens millions of records. I know that Mongo writes data with some padding factor, to allow fast updates, and I also know that to replicate the database Mongo should store operation log which requires some (actually, a lot of) space. Even with that knowledge I have no idea how to estimate the actual size required by Mongo given a size of a typical database record. By now I have a descrepancy with a factor of 2 - 3 between weekly repairs.
So the question is: How to estimate a total storage size required by MongoDB given an average record size in bytes?
The short answer is: you can't, not based solely on avg. document size (at least not in any accurate way).
To explain more verbosely:
The space needed on disk is not simply a function of the average document size. There is also the space needed for any indexes you create. Then there is the space needed if you do trigger those moves (despite padding, this does happen) - that space is placed on a list to be re-used but depending on the data you subsequently insert, it may or may not be possible to re-use that space.
You can also add into the fact that pre-allocation will mean that occasionally a handful of documents will increase your on-disk space utilization by ~2GB as a new data file is allocated. Of course, with sufficient data, this will be essentially a rounding error but it is worth bearing in mind.
The only way to estimate this type of data to size ratio, assuming a consistent usage pattern, is to trend it over time for your particular use case and track the disk space usage versus the data inserted (number of documents might be better than data volume depending on variability of doc size).
Similarly, if you track the insertion rate, doc size and the space gained back from a resync/repair. FYI - you can resync a secondary from scratch to get a "fresh" copy of the data files rather than running a repair, which can be less disruptive, and use less space depending on your set up.

Virtual Memory Page Replacement Algorithms

I have a project where I am asked to develop an application to simulate how different page replacement algorithms perform (with varying working set size and stability period). My results:
Vertical axis: page faults
Horizontal axis: working set size
Depth axis: stable period
Are my results reasonable? I expected LRU to have better results than FIFO. Here, they are approximately the same.
For random, stability period and working set size doesnt seem to affect the performance at all? I expected similar graphs as FIFO & LRU just worst performance? If the reference string is highly stable (little branches) and have a small working set size, it should still have less page faults that an application with many branches and big working set size?
More Info
My Python Code | The Project Question
Length of reference string (RS): 200,000
Size of virtual memory (P): 1000
Size of main memory (F): 100
number of time page referenced (m): 100
Size of working set (e): 2 - 100
Stability (t): 0 - 1
Working set size (e) & stable period (t) affects how reference string are generated.
|-----------|--------|------------------------------------|
0 p p+e P-1
So assume the above the the virtual memory of size P. To generate reference strings, the following algorithm is used:
Repeat until reference string generated
pick m numbers in [p, p+e]. m simulates or refers to number of times page is referenced
pick random number, 0 <= r < 1
if r < t
generate new p
else (++p)%P
UPDATE (In response to #MrGomez's answer)
However, recall how you seeded your input data: using random.random,
thus giving you a uniform distribution of data with your controllable
level of entropy. Because of this, all values are equally likely to
occur, and because you've constructed this in floating point space,
recurrences are highly improbable.
I am using random, but it is not totally random either, references are generated with some locality though the use of working set size and number page referenced parameters?
I tried increasing the numPageReferenced relative with numFrames in hope that it will reference a page currently in memory more, thus showing the performance benefit of LRU over FIFO, but that didn't give me a clear result tho. Just FYI, I tried the same app with the following parameters (Pages/Frames ratio is still kept the same, I reduced the size of data to make things faster).
--numReferences 1000 --numPages 100 --numFrames 10 --numPageReferenced 20
The result is
Still not such a big difference. Am I right to say if I increase numPageReferenced relative to numFrames, LRU should have a better performance as it is referencing pages in memory more? Or perhaps I am mis-understanding something?
For random, I am thinking along the lines of:
Suppose theres high stability and small working set. It means that the pages referenced are very likely to be in memory. So the need for the page replacement algorithm to run is lower?
Hmm maybe I got to think about this more :)
UPDATE: Trashing less obvious on lower stablity
Here, I am trying to show the trashing as working set size exceeds the number of frames (100) in memory. However, notice thrashing appears less obvious with lower stability (high t), why might that be? Is the explanation that as stability becomes low, page faults approaches maximum thus it does not matter as much what the working set size is?
These results are reasonable given your current implementation. The rationale behind that, however, bears some discussion.
When considering algorithms in general, it's most important to consider the properties of the algorithms currently under inspection. Specifically, note their corner cases and best and worst case conditions. You're probably already familiar with this terse method of evaluation, so this is mostly for the benefit of those reading here whom may not have an algorithmic background.
Let's break your question down by algorithm and explore their component properties in context:
FIFO shows an increase in page faults as the size of your working set (length axis) increases.
This is correct behavior, consistent with Bélády's anomaly for FIFO replacement. As the size of your working page set increases, the number of page faults should also increase.
FIFO shows an increase in page faults as system stability (1 - depth axis) decreases.
Noting your algorithm for seeding stability (if random.random() < stability), your results become less stable as stability (S) approaches 1. As you sharply increase the entropy in your data, the number of page faults, too, sharply increases and propagates the Bélády's anomaly.
So far, so good.
LRU shows consistency with FIFO. Why?
Note your seeding algorithm. Standard LRU is most optimal when you have paging requests that are structured to smaller operational frames. For ordered, predictable lookups, it improves upon FIFO by aging off results that no longer exist in the current execution frame, which is a very useful property for staged execution and encapsulated, modal operation. Again, so far, so good.
However, recall how you seeded your input data: using random.random, thus giving you a uniform distribution of data with your controllable level of entropy. Because of this, all values are equally likely to occur, and because you've constructed this in floating point space, recurrences are highly improbable.
As a result, your LRU is perceiving each element to occur a small number of times, then to be completely discarded when the next value was calculated. It thus correctly pages each value as it falls out of the window, giving you performance exactly comparable to FIFO. If your system properly accounted for recurrence or a compressed character space, you would see markedly different results.
For random, stability period and working set size doesn't seem to affect the performance at all. Why are we seeing this scribble all over the graph instead of giving us a relatively smooth manifold?
In the case of a random paging scheme, you age off each entry stochastically. Purportedly, this should give us some form of a manifold bound to the entropy and size of our working set... right?
Or should it? For each set of entries, you randomly assign a subset to page out as a function of time. This should give relatively even paging performance, regardless of stability and regardless of your working set, as long as your access profile is again uniformly random.
So, based on the conditions you are checking, this is entirely correct behavior consistent with what we'd expect. You get an even paging performance that doesn't degrade with other factors (but, conversely, isn't improved by them) that's suitable for high load, efficient operation. Not bad, just not what you might intuitively expect.
So, in a nutshell, that's the breakdown as your project is currently implemented.
As an exercise in further exploring the properties of these algorithms in the context of different dispositions and distributions of input data, I highly recommend digging into scipy.stats to see what, for example, a Gaussian or logistic distribution might do to each graph. Then, I would come back to the documented expectations of each algorithm and draft cases where each is uniquely most and least appropriate.
All in all, I think your teacher will be proud. :)

Prefetch distance and degree of prefetch

what is the difference between prefetch distance and degree of prefetching?
Prefetching typically deals with entire cache lines. So a given prefetch request will bring in the cache line that would hold the specified address.
Due to the huge differences in memory speeds, it can take many cycles to bring data into the cache. Some latencies are in the dozens of cycles if not longer. Now, the only way to really benefit from a prefetch is to issue it far enough ahead of the actual use of the data so that there's enough time for the machine to pull the data into the cache. This implies that data access be predictable so one can anticipate what memory needs to be in the cache. The simplest case is marching through a linear array. Now, a common scenario (in 'scientific code') is a loop that reads the data then processes it. The cache miss penalty may be high and the processor may be very fast, and simply prefetching the next cache line may not be sufficient as we may be finished processing the array corresponding to the current cache line and waiting for the data in the neighbouring cache line before the data has arrived at the cache. So we may have to fetch further away than the next cache line.
How far ahead you prefetch is the distance e.g. 512 bytes. The degree of prefetching is the distance in terms of cache lines i.e. if your cache line is 256 bytes, the degree of prefetching is 2.
Prefetching degree is the number of cache lines to prefetch at each trigger.
Prefetching distance is the concept from array within loop. D = ceil(l / s), l is average memory latency in terms of number of cycles, s is cycle time of shortest execution path. D is be number of iterations ahead for a certain array element, so that memory latency is covered.