Cache Memory and How it is filled? - cpu-cache

If a computer has L1 L2 and L3 cache and 2 cores available. When L1 on core 1 is filled. Does the L1 in core 2 start trying to store data or does the L2 in core 1 start trying to store the data?

Related

Calculating memory stalls while adding second level cache

I am trying to calculate memory stall cycles per instructions when adding the second level cache.
I have the following given values:
Direct Mapped cache with 128 blocks
16 KB cache
2ns Cache access time
1Ghz Clock Rate
1 CPI
80 clock cycles Miss Penalty
5% Miss rate
1.8 Memory Accesses per instruction
16 bit memory address
L2 Cache
4% Miss Rate
6 clock cycles miss penalty
As I understand it, the way to calculate the Memory stall cycles is by using the following formula:
Memory stall cycles = Memory accesses x Miss rate x Miss penalty
Which can be simplified as:
Memory stall cycles = instructions per program x misses per instructions x miss penalty
What I did was to multiply 1.8 x (.05 +.04) x (80 + 6) = 13.932
Would this be correct or am I missing something?
First of all, I am not sure about the given parameters for miss penalty for L1 and L2 (L1 being 80 cycles and L2 being 6 cycles).
Anyway using the data as it is:
You issue 1 instruction per clock
There are 1.8 memory instructions in an instruction.
There is a 5% that access can miss L1 and another 4% chance that it can miss L2. You would only access the main memory if you miss in both L1 and L2. That would be .04 * .05 = 0.002 = 0.2% This means per memory access, you are likely to access the main memory 0.2% of the time.
Since you have 1.8 memory accesses per instruction, you are likely to access main memory 0.002 * 1.8 = 0.0036 = 0.36% per instruction.
When you encounter a miss in both L1 and L2, you will get stalled for 80 + 6 = 86 cycles (ignoring any optimizations)
Per instruction, you would only encounter .36% main memory accesses. Hence the memory stall cycles per instruction is .0036 * 86 = 0.3096

l2 cache CPU comparisions

So, I have a older q8300 Quad Core CPU that uses a L2 cache of 4mb and I have been thinking of upgrading soon to a 6600k.
I took a look at all the specs of the 6600k and noticed it has a smaller L2 cache. It is around 1 MB, does this matter or does the 6600k's lower L2 cache not matter even in scenarios where applications depend on a higher L2 cache to put data through? I want to make sure I know how the L2 cache pipelines work in a streamlined way, that way I can figure out if I should just keep my q8300 quad for L2 dependent applications.
I was thinking, maybe due to the more advanced architecture of the 6600k maybe it passes data through to L3 more efficiently? I'm not sure about all this and I am still learning, so if someone could explain this that would be awesome.
It's a little confusing since the q8300 is a several generations older Merom/Conroe based architecture which has only 2 cache levels, while the i5-6600k is Skylake based and has 3 cache levels (actually some of them have 4, with an additional 128MB worth of EDRam, but not the 6600 I think).
What you should be comparing between is the "last-level" shared caches of the two products - this is usually designed as some co-located cache strip aside each core, so the size usually scales with core count (except when you have chopped products). Your old system had 2MB per core as last level (L2) cache, and the new one has 1.5MB per core as the last level (L3), but with 2x the cores you have a higher overall. If you would run a single thread it would enjoy the higher capacity, and if you activate all cores, you would probably have a higher aggregated performance, depending on the task.
On top of that, you get a whole new cache level with 256k per core, which is now the L2. This is not really added to your overall capacity, as the last level cache is usually inclusive (see - Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor), so you're duplicating the same space in the L3, but it does serve to compensate for the additional distance of the L3. Also note that the performance should have grown quite dramatically over the 7 (!) generations that passed between these 2 CPUs, so you probably shouldn't worry too much -
http://cpuboss.com/cpus/Intel-Core2-Quad-Q8300-vs-Intel-Core-i5-6600K

Paging and TLB operating system

A hierarchical memory system that uses cache memory has cache access time of 50 nanosecond, main memory access time of 300 nanoseconds, 75% of memory requests are for read, hit ratio of 0.8 for read access and the write - through scheme is used. what will be the average access time of the system both for read and write requests.
A 157.5 ns
B 110 ns
C 75 ns
D 82.5 ns
Answer is A,157.5ns for read and write
Explanation:-For 1)average_read_access_time = 0.8*50+0.2*(50+300)=110 ns.
2) a. For read & write average take 75% of overall read_access= .75*110 plus 25% of write, that are only from main memory= .25*300
=.75*110+.25*300=157.5 ns.

Units of perf stat statistics

I'm using perf stat for some purposes and to better understand the working of the tool , I wrote a program that copies a file's contents into another . I ran the program on a 750MB file and the stats are below
31691336329 L1-dcache-loads
44227451 L1-dcache-load-misses
15596746809 L1-dcache-stores
20575093 L1-dcache-store-misses
26542169 cache-references
13410669 cache-misses
36859313200 cycles
75952288765 instructions
26542163 cache-references
what is the units of each number . what I mean is . Is it bits/bytes/ or something else . Thanks in advance.
The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.
31 691 336 329 L1-dcache-loads
44 227 451 L1-dcache-load-misses
15 596 746 809 L1-dcache-stores
20 575 093 L1-dcache-store-misses
26 542 169 cache-references
13 410 669 cache-misses
Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program
36 859 313 200 cycles
This is total count of instructions, executed from your program:
75 952 288 765 instructions
(I will use G suffix as abbreviation for billion)
From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.
In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.
Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html
perf stat -d md5sum *
578.920753 task-clock # 0.995 CPUs utilized
211 context-switches # 0.000 M/sec
4 CPU-migrations # 0.000 M/sec
212 page-faults # 0.000 M/sec
1,744,441,333 cycles # 3.013 GHz [20.22%]
1,064,408,505 stalled-cycles-frontend # 61.02% frontend cycles idle [30.68%]
104,014,063 stalled-cycles-backend # 5.96% backend cycles idle [41.00%]
2,401,954,846 instructions # 1.38 insns per cycle
# 0.44 stalled cycles per insn [51.18%]
14,519,547 branches # 25.080 M/sec [61.21%]
109,768 branch-misses # 0.76% of all branches [61.48%]
266,601,318 L1-dcache-loads # 460.514 M/sec [50.90%]
13,539,746 L1-dcache-load-misses # 5.08% of all L1-dcache hits [50.21%]
0 LLC-loads # 0.000 M/sec [39.19%]
(wrongevent?)0 LLC-load-misses # 0.00% of all LL-cache hits [ 9.63%]
0.581869522 seconds time elapsed
UPDATE Apr 18, 2014
please explain why cache-references are not correlating with L1-dcache numbers
Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.
44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).

Calculate the performance of a multicore architecture?

Cal a multicore architecture with 10 computing cores: 2 processor cores and 8 coprocessors. Each processor core can deliver 2.0 GFlops, while each coprocessor can deliver 1.0 GFlops. All computing cores can perform calculation simultaneously. Any instruction can execute in either processor or coprocessor cores unless there are any explicit restrictions.
If 70% of dynamic instructions in an application are parallelizable, what is the maximum average performance (Flops) you can get in the optimal situation? Please note that the remaining 30% instructions can be executed only after the execution of the parallel 70% is over.
Consider another application where all the dynamic instructions can be partitioned into 6 groups (A, B, C, D, E, F) with the following dependency. For example, A --> C implies that all the instructions in A need to be completed before starting the execution of instructions in C. Each of the first four groups (A, B, C and D) contains 20% of the dynamic instructions whereas each of the remaining two groups (E and F) contains 10% of the dynamic instructions. All the instructions in each group must be executed sequentially on the same processor or coprocessor core. How to schedule them on the multicore architecture to achieve the best possible performance? What is the maximum average performance (Flops) now?
A(20%) --> C(20%) -->
E(10%)-->F(10%)
B(20%) --> d(20%) -->
For the first part, you need to use Amdahl's Law, which is:
max speed-up = 1/(1-p+p/n)
where p is the parallelizable part. n is the improvement factor in executing the parallel portion.
(Note that the Amdahl's Law formula can be used for first order estimates on other types of changes. E.g., given a factor of N reduction in ALU energy use and P fraction of energy used by the ALU, one can find the improvement in total energy use.)
In your case, since the serial portion would be executed on the higher performance (2 GFLOPS) processor core, n is 6 ([8 coprocessor cores * 1 GFLOPS/core + 2 processor cores * 2 GFLOPS/core]/ 2 GFLOPS/processor core).
A quick calculation shows the max speed-up you can get is 2.4 related to 1 processor core. The maximum FLOPS would therefore be the speed-up times the speed if the whole program was executed serially on one processor core, i.e., 2.4 * 2 GFLOPS = 4.8 GFLOPS.
For the second part, note that initially there are two independent instruction streams: A -> C and B -> C. Since the system has two processor cores, both can be executed in parallel on the higher performance processor cores. Furthermore, both have the same amount of work (40% of total for each stream), so one the same performance core they will complete at the same time.
Since E depends on results from both C and D, it must be started after both finish. E and F would execute on a processor core (which core is arbitrary since E must wait for the tasks running on both processor cores to complete).
As you can see 80% of the program (40% for A+C; 40% for B+D) can be parallelized by a factor of 2 and 20% of the program (E+F) is serial. You can then just plug the numbers into the Amdahl's Law formula (p=0.8, n=2).