Main memory bandwidth measurement - bandwidth

I want to measure the main memory bandwidth and while looking for the methodology, I found that,
many used 'bcopy' function to copy bytes from a source to destination and then measure the time which they report as the bandwidth.
Others ways of doing it is to allocate and array and walk through the array (with some stride) - this basically gives the time to read the entire array.
I tried doing (1) for data size of 1GB and the bandwidth I got is '700MB/sec' (I used rdtsc to count the number of cycles elapsed for the copy). But I suspect that this is not correct because my RAM config is as follows:
Speed: 1333 MHz
Bus width: 32bit
As per wikipedia, the theoretical bandwidth is calculated as follows:
clock speed * bus width * # bits per clock cycle per line (2 for ddr 3
ram) 1333 MHz * 32 * 2 ~= 8GB/sec.
So mine is completely different from the estimated bandwidth. Any idea of what am I doing wrong?
=========
Other question is, bcopy involves both read and write. So does it mean that I should divide the calculated bandwidth by two to get only the read or only the write bandwidth? I would like to confirm whether the bandwidth is just the inverse of latency? Please suggest any other ways of measuring the bandwidth.

I can't comment on the effectiveness of bcopy, but the most straightforward approach is the second method you stated (with a stride of 1). Additionally, you are confusing bits with bytes in your memory bandwidth equation. 32 bits = 4bytes. Modern computers use 64 bit wide memory buses. So your effective transfer rate (assuming DDR3 tech)
1333Mhz * 64bit/(8bits/byte) = 10666MB/s (also classified as PC3-10666)
The 1333Mhz already has the 2 transfer/clock factored in.
Check out the wiki page for more info: http://en.wikipedia.org/wiki/DDR3_SDRAM
Regarding your results, try again with the array access. Malloc 1GB and traverse the entire thing. You can sum each element of the array and print it out so your compiler doesn't think it's dead code.
Something like this:
double time;
int size = 1024*1024*1024;
int sum;
*char *array = (char*)malloc(size);
//start timer here
for(int i=0; i < size; i++)
sum += array[i];
//end timer
printf("time taken: %f \tsum is %d\n", time, sum);

Related

How far is it safe to regularly write to the ESP32 flash, considering the flash MTBF?

What would be the best practice for writing to the flash on a regular basis.
Considering the hardware I am working on is supposed to have 10 to 20 years longevity, what would be your recommandation? For example, is it ok I write some state variables every 15 minutes thru Preferences?
That depends on
number of erase cycles your Flash supports,
size of the NVS partition where you store data and
size and structure of the data that you store.
Erase cycles mean how many times a single sector of Flash can be erased before it's no longer guaranteed to work. The number is found in the datasheet of the Flash chip that you use. It's usually 10K or 100K.
Preferences library uses the ESP-IDF NVS library. This requires an NVS partition to store data, the size of which determines how many Flash sectors get reserved for this purpose. Every time you store a value, NVS writes the data together with its own overhead (total of 32 bytes for primitive data types like ints and floats, more for strings and blobs) into the current Flash sector. When the current sector is full, it erases the next sector and proceeds to write there; thereby using up sectors in a round robin fashion as write requests come in.
If we assume that your Flash has 100K erase cycles, size your NVS partition is 128 KiB and you store a set of 8 primitive values (any int or float) every 15 minutes:
Each store operation uses 8 * 32 = 256 bytes (32 B per data value).
You can repeat that operation 131072 / 256 = 512 times before you've written to every sector of your 128KiB NVS partition (i.e. erased every sector once)
You can repeat that cycle 100K times so you can do 512 * 100000 = 51200000 or roughly 5.1M store operations before you've erased every sector its permitted maximum number of times.
Considering the interval of 15 minutes creates 365 * 24 * 4 = 35040 operations per year, you'd have 51200000 / 35040 = 1461 years until Flash is dead.
Obviously, if your Flash chip is rated at 10K erase cycles, it drops to only 146 years.
There's probably some NVS overhead in there somewhere that I didn't account for, and the Flash erase cycle ratings are not 100% reliable so I'd cut it in half for good measure - I would expect 700 or 70 years in real life.
If you store non-primitive values (strings, blobs) then the estimate changes based on the length of that data. I don't know to calculate the exact Flash space used by those but I'd guess 32B plus length of your data multiplied by 10% of NVS overhead. Plug in the numbers, see for yourself.

number of misses in spatial locality

I am confused how the number of misses calculated in the example below (from Computer Architecture: A Quantitative Approach)
Example:
For the code below, determine which accesses are likely to cause data cache
misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the
number of prefetch instructions executed and the misses avoided by prefetching.
Let’s assume we have an 8 KB direct-mapped data cache with 16-byte blocks,
and it is a write-back cache that does write allocate. The elements of a and b are 8
bytes long since they are double-precision floating-point arrays. There are 3 rows
and 100 columns for a and 101 rows and 3 columns for b. Let’s also assume they
are not in the cache at the start of the program.
for (i = 0; i < 3; i = i+1)
for (j = 0; j < 100; j = j+1)
a[i][j] = b[j][0] * b[j+1][0];
Answer:
The compiler will first determine which accesses are likely to cause cache
misses; otherwise, we will waste time on issuing prefetch instructions for data
that would be hits. Elements of a are written in the order that they are stored in
memory, so a will benefit from spatial locality: The even values of j will miss
and the odd values will hit. Since a has 3 rows and 100 columns, its accesses will
lead to 3 × (100/2), or 150 misses.
How 150 misses calculated or why divided 100 by 2?

Operating Systems Virtual Memory

I am a student reading Operating systems course for the first time. I have a doubt in the calculation of the performance degradation calculation while using demand paging. In the Silberschatz book on operating systems, the following lines appear.
"If we take an average page-fault service time of 8 milliseconds and a
memory-access time of 200 nanoseconds, then the effective access time in
nanoseconds is
effective access time = (1 - p) x (200) + p (8 milliseconds)
= (1 - p) x 200 + p x 8.00(1000
= 200 + 7,999,800 x p.
We see, then, that the effective access time is directly proportional to the
page-fault rate. If one access out of 1,000 causes a page fault, the effective
access time is 8.2 microseconds. The computer will be slowed down by a factor
of 40 because of demand paging! "
How did they calculate the slowdown here? Is 'performance degradation' and slowdown the same?
This is whole thing is nonsensical. It assumes a fixed page fault rate P. That is not realistic in itself. That rate is a fraction of memory accesses that result in a page fault.
1-P is the fraction of memory accesses that do not result in a page fault.
T= (1-P) x 200ns + p (8ms) is then the average time of a memory access.
Expanded
T = 200ns + p (8ms - 200ns)
T = 200ns + p (799980ns)
The whole thing is rather silly.
All you really need to know is a nanosecond is 1/billionth of a second.
A microsecond is 1/thousandth of a second.
Using these figures, there is a factor of a million difference between the access time in memory and in disk.

Interrupt time in DMA operation

I'm facing difficulty with the following question :
Consider a disk drive with the following specifications .
16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing mode whereby whenever 1 byte word is ready it is sent to memory; similarly for writing, the disk interface reads a 4 byte word from the memory in each DMA cycle. Memory Cycle time is 40 ns. The maximum percentage of time that the CPU gets blocked during DMA operation is?
the solution to this question provided on the only site is :
Revolutions Per Min = 3000 RPM
or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50
= 6553600 ............. (1)
Interrupt = 6553600 takes 0.2621 sec
Percentage Gain = (0.2621/1)*100
= 26 %
I have understood till (1).
Can anybody explain me how has 0.2621 come ? How is the interrupt time calculated? Please help .
Reversing form the numbers you've given, that's 6553600 * 40ns that gives 0.2621 sec.
One quite obvious problem is that the comments in the calculations are somewhat wrong. It's not
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50 <- WRONG
The numbers are 512K / 4 * 50. So, it's in bytes. How that could be called 'number of tracks'? Reading the full track is 1 full rotation, so the number of tracks readable in 1 second is 50, as there are 50 RPS.
However, the total bytes readable in 1s is then just 512K * 50 since 512K is the amount of data on the track.
But then it is further divided by 4..
So, I guess, the actual comments should be:
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
Interrupts per second = (2^19/2^2) * 50 = 6553600 (*)
Interrupt triggers one memory op, so then:
total wasted: 6553600 * 40ns = 0.2621 sec.
However, I don't really like how the 'number of interrupts per second' is calculated. I currently don't see/fell/guess how/why it's just Bytes/4.
The only VAGUE explanation of that "divide it by 4" I can think of is:
At each byte written to the controller's memory, an event is triggered. However the DMA controller can read only PACKETS of 4 bytes. So, the hardware DMA controller must WAIT until there are at least 4 bytes ready to be read. Only then the DMA kicks in and halts the bus (or part of) for a duration of one memory cycle needed to copy the data. As bus is frozen, the processor MAY have to wait. It doesn't NEED to, it can be doing its own ops and work on cache, but if it tries touching the memory, it will need to wait until DMA finishes.
However, I don't like a few things in this "explanation". I cannot guarantee you that it is valid. It really depends on what architecture you are analyzing and how the DMA/CPU/BUS are organized.
The only mistake is its not
no. of tracks read
Its actually no. of interrupts occured (no. of times DMA came up with its data, these many times CPU will be blocked)
But again I don't know why 50 has been multiplied,probably because of 1 second, but I wish to solve this without multiplying by 50
My Solution:-
Here, in 1 rotation interface can read 512 KB data. 1 rotation time = 0.02 sec. So, one byte data preparation time = 39.1 nsec ----> for 4B it takes 156.4 nsec. Memory Cycle time = 40ns. So, the % of time the CPU get blocked = 40/(40+156.4) = 0.2036 ~= 20 %. But in the answer booklet options are given as A) 10 B)25 C)40 D)50. Tell me if I'm doing wrong ?

Increasing/Altering Matlab-Arduino analogRead() sampling rate

I have been controlling Arduino from Matlab using ArduinoIO-Matlab interface. My current setup is I have 3 EMG Muscle Sensors (from Advancer Technologies) are connected to the Arduino at analog pin 1,2, and 3. Arduino is connected to Matlab. I am trying to collect data from these three pins simultaneously and store them in an matrix size 1000x3. My issue is the rate at which Matlab is sampling from the analog pin. It takes about 25 seconds to collect 1000 readings from the 3 pins simultaneously. I know arduino itself samples at a higher rate. Below is my code. How do I alter this to get a sampling rate of about like 1000 samples in 10 seconds ?
ar = arduino('COM3');
ax = zeros(1000,3);
for ai = 1:1000
ax(ai,:) = [ar.analogRead(1) ar.analogRead(2) ar.analogRead(3)];
end
delete(ar);
This is the time taken by the above code (profile viewer):
time calls line
< 0.01 1 3 ax = zeros(1000,3);
4
< 0.01 1 5 for ai = 1:1000
25.07 1000 6 ax(ai,:) = [ar.analogRead(1) ar.analogRead(2) ar.analogRead(3)];
1000 7 end
8
1.24 1 9 delete(ar);
Please let me know if there is something else that I need to clarify.
Thanks :Denter code here
You need to modify the arduino c++ code (.pde file).
In this code you should sample the signal as you prefer (1000 for example) and then transfer the sampled data to matlab using serial.writeln() method.
This will give you a sampling rate of ~3KHz (depending on alot of factors)...
The following very probably explains the result that you are seeing and why you need to do something like what Muhammad's answer suggests. While this reason was implied by his answer it was not spelt out so that others can avoid the 'trap'.
I do not have access to the underlying code and systems needed to check this answer with certainty. This answer is based on "typical methods" and has a modest chance of being sheer poppycock [tm], but the exact fit between observation and standard methods suggests this is what is happening. A very little delving by someone with the requisite system to hand will demonstrate if this is correct.
When data is sent one data sample at a time you incur a per-sample overhead significantly in excess of the time taken to just transfer the raw data.
You say it takes 25 seconds to transfer 3000 samples.
The time per sample = 25/3000 = 8.333 ms per sample.
Assume a 9600 baud data transfer rate.
The default communications speed is liable to but 9600 baud. This can be checked but the result suggests that this may be correct and making slightly different assumptions provides an equally good explanation.
Serial coms usually uses N81 format = 1 start bit, 8 data bits, 1 stop bit per 8 bit byte.
So 1 bit takes 1/9600 s
and 10 bits take 10/9600 = 1.042 mS
And sample time / byte time
= 8.333 / 1.042 = 7.997 word times.
In fact if you do the calculations without rounding or truncation, ie
25 / 3000 x 9600/10 = 8.000.... .
ie your transfer is taking EXACTLY 8 x 9600 baud word times per sample.
Equally, this is exactly 4 x 4800 baud or 2 x 2400 baud transfer times.
I have not examined the format used but imagine that to work with the PC monitor program the basic serial routine may use
2 x data bytes + CR + LF = 4 bytes.
That assumes a 16 bit variable sent as 2 x 8 bit binary words.
More likely = either
- 16 bits sent as 4 x ASCII characters or
- 24 bits sent as 6 x ASCII characters.
In the absence of suitably deep delving, the use of 6 ASCII words and a CR + LF at 9600 baud provides such a good fit using typical parameters that Occam probably opines that this is the best starting point. Regardless of whether the total requirement is 8 or 4 or 2 bytes, the somewhat serendipitous exact match between your observed data rate and standard baud rates suggests that this provides the basic reason for what you see.
Looking at the code will rapidly show what baud rate, data length and packing is used.