Paging and TLB operating system - operating-system

A hierarchical memory system that uses cache memory has cache access time of 50 nanosecond, main memory access time of 300 nanoseconds, 75% of memory requests are for read, hit ratio of 0.8 for read access and the write - through scheme is used. what will be the average access time of the system both for read and write requests.
A 157.5 ns
B 110 ns
C 75 ns
D 82.5 ns

Answer is A,157.5ns for read and write
Explanation:-For 1)average_read_access_time = 0.8*50+0.2*(50+300)=110 ns.
2) a. For read & write average take 75% of overall read_access= .75*110 plus 25% of write, that are only from main memory= .25*300
=.75*110+.25*300=157.5 ns.

Related

Interpreting.Q.w[] for potential problems?

From this page we know that .Q.w[] gives us for example:
used| 108432 / bytes malloced
heap| 67108864 / heap bytes available
peak| 67108864 / heap high-watermark in bytes
wmax| 0 / workspace limit from -w param
mmap| 0 / amount of memory mapped
syms| 537 / number of symbols interned
symw| 15616 / bytes used by 537 symbols
If I wanted to monitor the instance for memory issues (eg. memory full) should I be looking at used or heap or a combination?
If you want to monitor how much is currently being used you would use used but it's only a rough estimate of the actual used as it doesn't take into account the memory used by interned strings (symbols) or memory-mapped files.
Monitoring the heap is useful to get a sense of how your memory spikes (and peak gives what the max spike is) but it wouldn't necessarily be ideal for informing you if you're close to your limit because if you have a big memory spike and you hit your limit then the process will die before you have a chance to monitor the fact that the spike was close to the limit.
Ultimately I would monitor both (and peak) and allow yourself buffers in both cases. Have a low-level alert if the heap/peak reaches say 50% of the limit, higher levels at 60%, 70% etc. Then also monitor your used as a percentage of your heap/peak. If your used is a high percentage of your heap - and your heap is a high percentage of your limit - then this could be alarming. Essentially your process could either be:
Low-medium memory usage but spiking:
If the used is generally a low-medium percentage of the heap/peak then your process is using low-med memory but spiking. This is pretty harmless and expected if crunching a lot of data
used is a high % of heap/peak and heap/peak is a high % of max
Here you might have a situation where a process is storing more and more memory without releasing. So the used is continually growing and the heap/peak is continually growing with it. This is a problem if unchecked.
So essentially you want to capture behaviour 2 while allowing behaviour 1.
There are some other behaviour patterns also but this would be the general gist. Whether or not automatic garbage collect is enabled also plays into it. If auto garbage collect isn't enabled and used is a lot less than heap then this process is hogging memory that it doesn't need to.

Calculating memory stalls while adding second level cache

I am trying to calculate memory stall cycles per instructions when adding the second level cache.
I have the following given values:
Direct Mapped cache with 128 blocks
16 KB cache
2ns Cache access time
1Ghz Clock Rate
1 CPI
80 clock cycles Miss Penalty
5% Miss rate
1.8 Memory Accesses per instruction
16 bit memory address
L2 Cache
4% Miss Rate
6 clock cycles miss penalty
As I understand it, the way to calculate the Memory stall cycles is by using the following formula:
Memory stall cycles = Memory accesses x Miss rate x Miss penalty
Which can be simplified as:
Memory stall cycles = instructions per program x misses per instructions x miss penalty
What I did was to multiply 1.8 x (.05 +.04) x (80 + 6) = 13.932
Would this be correct or am I missing something?
First of all, I am not sure about the given parameters for miss penalty for L1 and L2 (L1 being 80 cycles and L2 being 6 cycles).
Anyway using the data as it is:
You issue 1 instruction per clock
There are 1.8 memory instructions in an instruction.
There is a 5% that access can miss L1 and another 4% chance that it can miss L2. You would only access the main memory if you miss in both L1 and L2. That would be .04 * .05 = 0.002 = 0.2% This means per memory access, you are likely to access the main memory 0.2% of the time.
Since you have 1.8 memory accesses per instruction, you are likely to access main memory 0.002 * 1.8 = 0.0036 = 0.36% per instruction.
When you encounter a miss in both L1 and L2, you will get stalled for 80 + 6 = 86 cycles (ignoring any optimizations)
Per instruction, you would only encounter .36% main memory accesses. Hence the memory stall cycles per instruction is .0036 * 86 = 0.3096

Operating Systems Virtual Memory

I am a student reading Operating systems course for the first time. I have a doubt in the calculation of the performance degradation calculation while using demand paging. In the Silberschatz book on operating systems, the following lines appear.
"If we take an average page-fault service time of 8 milliseconds and a
memory-access time of 200 nanoseconds, then the effective access time in
nanoseconds is
effective access time = (1 - p) x (200) + p (8 milliseconds)
= (1 - p) x 200 + p x 8.00(1000
= 200 + 7,999,800 x p.
We see, then, that the effective access time is directly proportional to the
page-fault rate. If one access out of 1,000 causes a page fault, the effective
access time is 8.2 microseconds. The computer will be slowed down by a factor
of 40 because of demand paging! "
How did they calculate the slowdown here? Is 'performance degradation' and slowdown the same?
This is whole thing is nonsensical. It assumes a fixed page fault rate P. That is not realistic in itself. That rate is a fraction of memory accesses that result in a page fault.
1-P is the fraction of memory accesses that do not result in a page fault.
T= (1-P) x 200ns + p (8ms) is then the average time of a memory access.
Expanded
T = 200ns + p (8ms - 200ns)
T = 200ns + p (799980ns)
The whole thing is rather silly.
All you really need to know is a nanosecond is 1/billionth of a second.
A microsecond is 1/thousandth of a second.
Using these figures, there is a factor of a million difference between the access time in memory and in disk.

Units of perf stat statistics

I'm using perf stat for some purposes and to better understand the working of the tool , I wrote a program that copies a file's contents into another . I ran the program on a 750MB file and the stats are below
31691336329 L1-dcache-loads
44227451 L1-dcache-load-misses
15596746809 L1-dcache-stores
20575093 L1-dcache-store-misses
26542169 cache-references
13410669 cache-misses
36859313200 cycles
75952288765 instructions
26542163 cache-references
what is the units of each number . what I mean is . Is it bits/bytes/ or something else . Thanks in advance.
The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.
31 691 336 329 L1-dcache-loads
44 227 451 L1-dcache-load-misses
15 596 746 809 L1-dcache-stores
20 575 093 L1-dcache-store-misses
26 542 169 cache-references
13 410 669 cache-misses
Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program
36 859 313 200 cycles
This is total count of instructions, executed from your program:
75 952 288 765 instructions
(I will use G suffix as abbreviation for billion)
From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.
In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.
Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html
perf stat -d md5sum *
578.920753 task-clock # 0.995 CPUs utilized
211 context-switches # 0.000 M/sec
4 CPU-migrations # 0.000 M/sec
212 page-faults # 0.000 M/sec
1,744,441,333 cycles # 3.013 GHz [20.22%]
1,064,408,505 stalled-cycles-frontend # 61.02% frontend cycles idle [30.68%]
104,014,063 stalled-cycles-backend # 5.96% backend cycles idle [41.00%]
2,401,954,846 instructions # 1.38 insns per cycle
# 0.44 stalled cycles per insn [51.18%]
14,519,547 branches # 25.080 M/sec [61.21%]
109,768 branch-misses # 0.76% of all branches [61.48%]
266,601,318 L1-dcache-loads # 460.514 M/sec [50.90%]
13,539,746 L1-dcache-load-misses # 5.08% of all L1-dcache hits [50.21%]
0 LLC-loads # 0.000 M/sec [39.19%]
(wrongevent?)0 LLC-load-misses # 0.00% of all LL-cache hits [ 9.63%]
0.581869522 seconds time elapsed
UPDATE Apr 18, 2014
please explain why cache-references are not correlating with L1-dcache numbers
Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.
44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).

Interrupt time in DMA operation

I'm facing difficulty with the following question :
Consider a disk drive with the following specifications .
16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing mode whereby whenever 1 byte word is ready it is sent to memory; similarly for writing, the disk interface reads a 4 byte word from the memory in each DMA cycle. Memory Cycle time is 40 ns. The maximum percentage of time that the CPU gets blocked during DMA operation is?
the solution to this question provided on the only site is :
Revolutions Per Min = 3000 RPM
or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50
= 6553600 ............. (1)
Interrupt = 6553600 takes 0.2621 sec
Percentage Gain = (0.2621/1)*100
= 26 %
I have understood till (1).
Can anybody explain me how has 0.2621 come ? How is the interrupt time calculated? Please help .
Reversing form the numbers you've given, that's 6553600 * 40ns that gives 0.2621 sec.
One quite obvious problem is that the comments in the calculations are somewhat wrong. It's not
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50 <- WRONG
The numbers are 512K / 4 * 50. So, it's in bytes. How that could be called 'number of tracks'? Reading the full track is 1 full rotation, so the number of tracks readable in 1 second is 50, as there are 50 RPS.
However, the total bytes readable in 1s is then just 512K * 50 since 512K is the amount of data on the track.
But then it is further divided by 4..
So, I guess, the actual comments should be:
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
Interrupts per second = (2^19/2^2) * 50 = 6553600 (*)
Interrupt triggers one memory op, so then:
total wasted: 6553600 * 40ns = 0.2621 sec.
However, I don't really like how the 'number of interrupts per second' is calculated. I currently don't see/fell/guess how/why it's just Bytes/4.
The only VAGUE explanation of that "divide it by 4" I can think of is:
At each byte written to the controller's memory, an event is triggered. However the DMA controller can read only PACKETS of 4 bytes. So, the hardware DMA controller must WAIT until there are at least 4 bytes ready to be read. Only then the DMA kicks in and halts the bus (or part of) for a duration of one memory cycle needed to copy the data. As bus is frozen, the processor MAY have to wait. It doesn't NEED to, it can be doing its own ops and work on cache, but if it tries touching the memory, it will need to wait until DMA finishes.
However, I don't like a few things in this "explanation". I cannot guarantee you that it is valid. It really depends on what architecture you are analyzing and how the DMA/CPU/BUS are organized.
The only mistake is its not
no. of tracks read
Its actually no. of interrupts occured (no. of times DMA came up with its data, these many times CPU will be blocked)
But again I don't know why 50 has been multiplied,probably because of 1 second, but I wish to solve this without multiplying by 50
My Solution:-
Here, in 1 rotation interface can read 512 KB data. 1 rotation time = 0.02 sec. So, one byte data preparation time = 39.1 nsec ----> for 4B it takes 156.4 nsec. Memory Cycle time = 40ns. So, the % of time the CPU get blocked = 40/(40+156.4) = 0.2036 ~= 20 %. But in the answer booklet options are given as A) 10 B)25 C)40 D)50. Tell me if I'm doing wrong ?