INTEL X86，why do align access and non-align access have same performance?

INTEL X86，why do align access and non-align access have same performance? - x86-64

From INTEL CPU manual(Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D):System Programming Guide 8.1.1), it says "nonaligned data accesses will seriously impact the performance of the processor". Then I do a test in order to prove it. But the result is that aligned and nonaligned data accesses have the same performance. Why??? Could someone help? My code is shown below:
#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;
static inline int64_t get_time_ns()
{
std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
return a.count();
}
int main(int argc, char** argv)
{
if (argc < 2) {
cout << "Usage：./test [01234567]" << endl;
cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
return 0;
}
uint64_t offset = atoi(argv[1]);
cout << "offset = " << offset << endl;
const uint64_t BUFFER_SIZE = 800000000;
uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
if (data_ptr == nullptr) {
cout << "apply for memory failed" << endl;
return 0;
}
memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
const uint64_t LOOP_CNT = 300;
cout << "start" << endl;
auto start = get_time_ns();
for (uint64_t i = 0; i < LOOP_CNT; ++i) {
for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
//mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019
//mov QWORD PTR [rsp+0x8],rax
++tmp;
//mov rcx,QWORD PTR [rsp+0x8]
//add rcx,0x1
//mov QWORD PTR [rsp+0x8],rcx
*(uint64_t*)&data_ptr[j] = tmp; // write to memory
//mov rcx,QWORD PTR [rbx+rdx*1],rcx
}
}
auto end = get_time_ns();
cout << "time elapse " << end - start << "ns" << endl;
return 0;
}
RESULT:
offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns

On most modern x86 cores, the performance of aligned and misaligned is the same only if the access does not cross a specific internal boundary.
The exact size of the internal boundary varies based on the core architecture of the relevant CPU, but on Intel CPUs from the last decade, the relevant boundary is the 64-byte cache line. That is, accesses which fall entirely within a 64-byte cache line perform the same regardless of whether they are aligned or not.
However, if a (necessarily misaligned) access crosses a cache line boundary on an Intel chip, however, a penalty is paid of about 2x in both latency and throughput. The bottom-line impact of this penalty depends on the surrounding code and will often be much less than 2x and sometimes close to zero. This modest penalty may be much larger if a 4K page boundary is also crossed.
Aligned accesses never cross these boundaries, so cannot suffer this penalty.
The broad picture is similar for AMD chips, though the relevant boundary as been smaller than 64 bytes on some recent chips, and the boundary is different for loads and stores.
I have included additional details in the load throughput and store throughput sections of a blog post I wrote.
Testing It
Your test wasn't able to show the effect for several reasons:
The test didn't allocate aligned memory, you can't reliably cross a cache line by using an offset from a region with unknown alignment.
You iterated 8 bytes at a time, so the majority of the writes (7 out of 8) will fall in a cache line any have no penalty, leading to a small signal which will only be detectable if the rest of your benchmark is very clean.
You use a large buffer size, which doesn't fit in any level of the cache. The split-line effect is only fairly obvious at the L1, or when splitting lines mean you bring in twice the number of lines (e.g., random access). Since you access every line linearly in either scenario, you'll be limited by throughput from DRAM to the core, regardless of splits or not: the split writes have plenty of time to complete while waiting for main memory.
You use a local volatile auto tmp and tmp++ which creates a volatile on the stack and a lot of loads and stores to preserve volatile semantics: these are all aligned and will wash out the effect you are trying to measure with your test.
Here is my modification of your test, operating only in the L1 region, and which advances 64 bytes at a time, so every store will be a split if any is:
#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>
using namespace std;
static inline int64_t get_time_ns()
{
std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
return a.count();
}
int main(int argc, char** argv)
{
if (argc < 2) {
cout << "Usage：./test [01234567]" << endl;
cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
return 0;
}
uint64_t offset = atoi(argv[1]);
const uint64_t BUFFER_SIZE = 10000;
alignas(64) uint8_t data_ptr[BUFFER_SIZE];
memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
const uint64_t LOOP_CNT = 1000000;
auto start = get_time_ns();
for (uint64_t i = 0; i < LOOP_CNT; ++i) {
uint64_t src = rand();
for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
memcpy(data_ptr + j, &src, 8);
}
}
auto end = get_time_ns();
cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
"ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
return 0;
}
Running this for all alignments in 0 to 64, I get:
$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
0 :time elapsed 0.56ns per write (rand:0)
1 :time elapsed 0.57ns per write (rand:0)
2 :time elapsed 0.57ns per write (rand:0)
3 :time elapsed 0.56ns per write (rand:0)
4 :time elapsed 0.56ns per write (rand:0)
5 :time elapsed 0.56ns per write (rand:0)
6 :time elapsed 0.57ns per write (rand:0)
7 :time elapsed 0.56ns per write (rand:0)
8 :time elapsed 0.57ns per write (rand:0)
9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)
Note that offsets 57 through 63 all take about 2x as long per write, and those are exactly the offsets that cross a 64-byte (cache line) boundary for an 8-byte write.

Related

How can I interrupt a 'loop' in kdb?

numb is a list of numbers:
q))input
42 58 74 51 63 23 41 40 43 16 64 29 35 37 30 3 34 33 25 14 4 39 66 49 69 13..
31 41 39 27 9 21 7 25 34 52 60 13 43 71 10 42 19 30 46 50 17 33 44 28 3 62..
15 57 4 55 3 28 14 21 35 29 52 1 50 10 39 70 43 53 46 68 40 27 13 69 20 49..
3 34 11 53 6 5 48 51 39 75 44 32 43 23 30 15 19 62 64 69 38 29 22 70 28 40..
18 30 60 56 12 3 47 46 63 19 59 34 69 65 26 61 50 67 8 71 70 44 39 16 29 45..
I want to iterate through each row and calculate the sum of the first 2 and then 3 and then 4 numbers etc. If that sum is greater than 1000 I want to stop the iteration on that particualr row and jump on the next row and do the same thing. This is my code:
{[input]
tot::tot+{[x;y]
if[1000<sum x;:count x;x,y]
}/[input;input]
}each numb
My problem here is that after the count of x is added to tot the over keeps going on the same row. How can I exit over and jump on the next row?
UPDATE: (QUESTION STILL OPEN) I do appreciate all the answers so far but I am not looking for an efficient way to sum the first n numbers. My question is how do I break the over and jump on the next line. I would like to achieve the same thing as with those small scripts:
C++
for (int i = 0; i <= 100; i++) {
if (i = 50) { printf("for loop exited at: %i ", i); break; }
}
Python
for i in range(100):
if i == 50:
print(i);
break;
R
for(i in 1:100){
if(i == 50){
print(i)
break
}
}

I think this is what you are trying to accomplish.
sum {(x & sums y) ? x}[1000] each input
It takes a cumulative sum of each row and takes an element wise minimum between that sum and the input limit thereby capping the output at the limit like so:
q)(100 & sums 40 43 16 64 29)
40 83 99 100 100
It then uses the ? operator to find the first occurance of that limit (i.e the element where this limit was equaled or passed) adding one as it is 0 indexed. In the example the first 100 occurs after 3 elements. You might want add one to include the first element after the limit in the count.
q)40 83 99 100 100 ? 100
3
And then it sums this count over all rows of the input.

You could use coverage in this case to exit when you fail to satisfy a condition
https://code.kx.com/q/ref/adverbs/#converge-repeat
The first parameter would be a function that does your check based on the current value of x which will be the next value to be passed in the main function.
For your example ive made a projection using the main input line then increase the indexes of what i am summing each time:
q)numb
98 11 42 97 89 80 73 35 4 30
86 33 38 86 26 15 83 71 21 22
23 43 41 80 56 11 22 28 47 57
q){[input] {x+1}/[{100>sum (y+1)#x}[input;];0] }each numb
1 1 2
this returns the first index of each where running sum is over 100
However this isn't really an ideal use case of KDB
could instead be done with something like
(sums#/:numb) binr\: 100
maybe your real example makes more sense

You can use while loops in KDB although all KDB developers are generally too afraid of being openly mocked and laughed at for doing so
q){i:0;while[i<>50;i+:1];:"loop exited at ",string i}`
"loop exited at 50"

Kdb does have a "stop loop" mechanism but only in the case of a monadic function with single seed value
/keep squaring until number is no longer less than 1000, starting at 2
q){x*x}/[{x<1000};2]
65536
/keep dealing random numbers under 20 until you get an 18 (seed value 0 is irrelevant)
q){first 1?20}\[18<>;0]
0 19 17 12 15 10 18
However this doesn't really fit your use case and as other people have pointed out, this is not how you would/should solve this problem in kdb.

Implementing matching pursuit algorithm

I have implemented matching pursuit algorithm but i m unable to get the required result.
Here is my code:
D=[1 6 11 16 21 26 31 36 41 46
2 7 12 17 22 27 32 37 42 47
3 8 13 18 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
5 10 15 20 25 30 35 40 45 50];
b=[6;7;8;9;10];
n=size(D);
A1=zeros(n);
R=b;
H=10;
if(H <= 0)
error('The number of iterations needs to be greater then 0')
end;
for k=1:1:H
[c,d] = max(abs(D'*R)); %//'
A1(:,d)=D(:,d);
D(:,d)=0;
y = A1\b;
R = b-A1*y;
end
Output
y=
0.8889
0
0
0
0
0
0
0
0
0.1111
I should get only non-zero value at (2,1) and other values should be zero but I'm getting 2 non-zero value. Can you please help me find out where the error is?
Thanks.

I checked with:
http://www.scholarpedia.org/article/Matching_pursuit
Your functions need to be normalized!
D = D./repmat(sum(D,1),5,1);
I get the following algorithm:
D=[1 6 11 16 21 26 31 36 41 46
2 7 12 17 22 27 32 37 42 47
3 8 13 18 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
5 10 15 20 25 30 35 40 45 50];
D = D./repmat(sum(D,1),5,1);
b=[6;7;8;9;10];
n=size(D);
A1=zeros(n);
R=b;
H=100;
if(H <= 0)
error('The number of iterations needs to be greater then 0')
end;
a = zeros(1,H);
G = zeros(size(D,1),H);
for k=1:1:H
ip = D'*R;
[~,d] = max(abs(ip)); %//'
G(:,k) = D(:,d);
a(k) = ip(d);
R = R-a(k)*G(:,k);
end
% recover signal:
Rrec = zeros(size(R));
for i=1:H
Rrec = Rrec + a(i)*G(:,i);
end
figure();
plot(b);
hold on;
plot(Rrec)
It approximates the signal quite well. But not with D(:,2) at first as expected. Maybe it is a starting point...

Here is the updated code. This is based on the algorithm provided at https://en.wikipedia.org/wiki/Matching_pursuit
clc;
clear all;
D=[1 6 11 16 21 26 31 36 41 46
2 7 12 17 22 27 32 37 42 47
3 8 13 18 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
5 10 15 20 25 30 35 40 45 50];
b=[6;7;8;9;10];
H=10;
for index=1:10
G(:,index)=D(:,index)./norm(D(:,index));
end
G1=G;
n=size(G);
R=b;
if(H <= 0)
error('The number of iterations needs to be greater then 0')
end;
if(H >size(D,2))
error('The number of iterations needs to be less than dictionary size')
end;
bIndex=1:size(G,2);
for k=H:-1:1
innerProduct=[];
for index=1:size(G,2)
innerProduct(index)=dot(R,G(:,index));
end
[c,d] = max(abs(innerProduct));
An(H-k+1)=innerProduct(d);
R = R-(An(H-k+1)*G(:,d));
G(:,d)=[];
strong(H-k+1)=bIndex(d);
bIndex(d)=[];
end
G_new=G1(:,strong);
%% reconstruction
bReconstructed=zeros(size(G_new,1),1);
for index=1:size(G_new,2)
bReconstructed(:,index) = (An(index)*G_new(:,index));
end
b_new=sum(bReconstructed,2)

Yes the atoms in the dictionary must be normalized so that the inner products of the current residual with different atoms can be compared fairly.
You may want to check my OMP implementation which also includes incremental Cholesky updates for the least square step in OMP at https://github.com/indigits/sparse-plex/blob/master/library/%2Bspx/%2Bpursuit/%2Bsingle/omp_chol.m
I have written detailed tutorial notes on OMP in my library documentation at https://sparse-plex.readthedocs.io/en/latest/book/pursuit/omp/index.html
My library sparse-plex contains a C implementation of OMP which is close to 4 times faster than fastest MATLAB implementations. See the discussion here https://sparse-plex.readthedocs.io/en/latest/book/pursuit/omp/fast_omp.html

Slow speed of UDP reception in Matlab

My FPGA is sending UDP packets on network using 100 mbps ethernet and a have written a MATLAB code to capture the data. The problem is i am getting very low speed in MATLAB around 50 kbps during reception. FPGA kit is connected to a gbps switch and then to PC. No internet cable in the switch.
I am pasting the matlab code below. If i try to increase the speed by increasing buffer size, the packets are dropped. current settings are through hit and trial on which i receive all data successfully. IS there any way to increase data reception speed in MATLAB?
Code:: (UDP from FPGA to Matlab)
clc
clear all
close all
u=udp('192.168.0.100','RemotePort',4660,'Localport',4661);
set(u,'DatagramTerminateMode','off');
set(u, 'InputBufferSize', 18);
set(u,'Timeout',0.1);
fopen(u);
x=tic;
for i =1:1000
a(:,i) = fread(u,18);
end
fclose(u);
delete(u);
t=toc(x);
bw = (1000*18*8)/t;
/////////////////////////////////////////////////////////
A MODIFIED VERSION OF THE ABOVE CODE (EASE OF UNDERSTANDING) + IMAGE Showing the PROBLEM
also: An image showing Data Variable with a buffer size of 20 Packets (18 bytes / Packet). Data must not be all zero as pointed in the image. It represents missed packets.
/////////////////////////////////////////////////////////
clc
clear all
close all
packet_size = 18; % Size of 1 Packet
buffer_size = 1*packet_size; % Buffer to store 1024 packets each of Packet_Size
buffer_read_count = 10; % How many times the buffer must be read
u=udp('192.168.0.100','RemotePort',4660,'Localport',4661);
set(u,'DatagramTerminateMode','off');
set(u, 'InputBufferSize', buffer_size);
set(u,'Timeout',0.5);
fopen(u);
x=tic;
for i =1:buffer_read_count
[a, count] = fread(u,buffer_size); % Read the complete buffer in one Fread()
if (count == buffer_size)
data(:, i) = a; %If Read_BYtes(Count) == BufferSize Store in Data
end
end
fclose(u);
delete(u);
t=toc(x);
bw = (buffer_read_count*buffer_size*8)/t; %Speed / BW of UDP Reception

I looked at your code and found some basic corrections, let me know if it speed up your code.
u=udp('192.168.0.100','RemotePort',4660,'Localport',4661);
set(u,'DatagramTerminateMode','off', ...
'InputBufferSize', 18, ...
'Timeout',0.1); % I think only one call of set is needed here
fopen(u);
x=tic;
% The variable a is not pre-allocated before the loop
a = zeros(YourNumberOfLine, 1000)
for ii =1:1000 % Always use ii and jj and not i and j
a(:,ii) = fread(u,18);
end
fclose(u);
delete(u);
t=toc(x);
bw = (1000*18*8)/t;

Let me summarize my comments.
Low code efficiency
As #m_power has pointed out, using i and j slows down your code by a bit. See this for more information. In Matlab, you should always use ii and jj instead.
You didn't initialize data. See how Mathworks explain this. If #1 "slows by a bit", #2 then slows a lot.
Since your code is slow, it's not guaranteed that each time FPGA sends a packet, your PC is able to find any available buffer to receive the packet.
Full buffer
if (count == buffer_size)
data(:, i) = a; %If Read_BYtes(Count) == BufferSize Store in Data
end
So if the packet is smaller than the buffer, data(:,i) = nothing? That is the most possible reason why you are getting zeros in column 3,4,and 5.
Empty buffer
Zeros in column 3, 4 and 5 may also originate from an empty buffer, if you have done the previous changes. The buffer is not guaranteed to carry something when Matlab reads it, so some for iterations may catch zero-length contents, data(:,ii) = 0.
Use a while loop to solve this issue. Only count for non-empty buffer readings.
ii = 0;
while (ii < buffer_read_count)
[a, count] = fread(u, buffer_size);
if count % non-empty reading
ii = ii+1;
data(1:count,ii) = a;
end
end
....incomplete packets?
You wait for a full buffer, because each time you want to read an entire packet? I suddenly realized it; how stupid I was!
But what you have done is keeping reading the buffer, and throwing away the data as long as it's shorter than the buffer length.
Instead, you'll need aggregate the data in each loop.
data = zeros(buffer_size, buffer_read_count);
total_size = buffer_read_count*buffer_size;
ptr = 1; % 1-D array index of data
while (ptr < total_size)
[a, count] = fread(u, buffer_size);
if count % non-empty reading
if ( (ptr+count) > total_size )
data(ptr:end) = a(1:(total_size-ptr+1));
ptr = total_size;
else
data( ptr:(ptr+count-1) ) = a;
ptr = ptr+count;
end
end
end
Test - I changed fread to a random integer generator with ii remembering how many times the buffer is read.
clear all;clc;
buffer_size = 18;
buffer_read_count = 10;
data = zeros(buffer_size, buffer_read_count);
total_size = buffer_read_count*buffer_size;
ptr = 1; % 1-D array index of data
ii = 1;
while (ptr < total_size)
count = randi(buffer_size);
a = randi(9, count, 1) + ii*10; % 10's show number of buffer readings
ii = ii+1;
% [a, count] = fread(u, buffer_size);
if count % non-empty reading
if ( (ptr+count) > total_size )
data(ptr:end) = a(1:(total_size-ptr+1));
ptr = total_size;
else
data( ptr:(ptr+count-1) ) = a;
ptr = ptr+count;
end
end
end
disp(data)
The result is
13 38 51 63 72 93 104 125 141 164
12 35 53 63 73 96 101 123 148 168
14 33 55 68 72 99 106 124 142 168
14 37 51 69 77 91 109 127 145 165
12 33 57 66 76 96 114 137 143 168
14 39 56 63 72 94 117 139 144 169
11 46 55 61 72 93 111 139 146 164
16 42 58 68 75 93 119 135 153 164
26 41 58 66 79 109 126 139 152 166
33 43 58 69 75 102 122 132 152 177
35 48 53 61 81 108 125 131 153 174
36 49 55 66 95 102 125 133 165 177
31 47 57 63 94 109 129 136 164 179
35 47 51 72 98 108 128 135 162 175
36 43 51 74 94 104 129 139 169 175
32 46 53 74 95 107 127 144 164 173
38 48 55 78 97 105 124 145 168 171
39 44 59 77 98 108 129 147 166 172
As you can see, each time the length of fread output is either equal to or less than the buffer size. But it only jumps to the next column when the current one has been completely received.

MATLAB - Counting the # of rows that are only numbered with multiples of 5

I am currently doing a project that involves MATLAB and I just can't seem to figure out a way to solve this problem. I have a data set that looks like this:
262 23 34
262 23 34
262 23 35
262 23 38
262 23 38
262 23 39
262 23 40
262 23 41
262 23 42
262 23 43
262 23 45
262 23 46
262 23 47
262 23 48
262 23 50
262 23 50
262 23 51
262 23 52
262 23 55
262 23 57
262 23 58
263 0 0
263 0 2
263 0 4
263 0 7
263 0 10
263 0 15
263 0 25
263 0 29
263 0 32
263 0 39
263 1 1
272 23 28
272 23 30
272 23 56
273 0 1
273 0 2
273 0 3
273 0 3
273 0 4
273 0 4
273 0 5
273 0 5
273 0 6
273 0 8
273 0 10
273 0 32
273 0 37
From the left to the right represents Julian day, hour in UTC, and minute when a tip of a rain gauge was made.
I need to calculate 5 minute totals and its accumulation of each day, for example, on the day 262 the rain tips total from 13-15 minute (since the information before 23:34 is not provided), 13-20 accumulated, 13-25, 13-30,... etc. Like I said each time recorded is when one tip was made and the precipitation amount of one tip is 0.01 inch. So all I need to know is how many tips were made within one day in 5 minute interval.
Could anyone help me please?

Perhaps something like this:
%# all possible days represented in the data
%# (you could also do this for all 1:365 days)
days = unique(X(:,1));
%# cumulative counts in each 5 minutes intervals
%# (each column represents one of the days)
counts = zeros(24*60/5,numel(days));
for i=1:numel(days)
%# indices of the instances in that day
idx = (X(:,1) == days(i));
%# convert hour/minute into minute units
m = X(idx,2).*60 + X(idx,3);
%# count occurences in each 5 minute bin
c = accumarray(fix(m/5)+1, 1, [24*60/5 1]);
%# take the cumulative sum and store it
counts(:,i) = cumsum(c);
end
so for day=262 we have:
>> counts(end-6:end,1)
ans =
0 %# [00:00, 23:30)
2 %# [00:00, 23:35)
6 %# [00:00, 23:40)
10 %# [00:00, 23:45)
14 %# [00:00, 23:50)
18 %# [00:00, 23:55)
21 %# [00:00, 00:00 of next day)

you can convert day-hour-min into minutes. Suppose your data is stored in an n-by-3 matrix called data (how original). Then
>> minutes = data * [24*60; 60; 1]; % counting minutes from the begining
Now you have to define the bins' edges (the intervals for summation):
>> edges = min(minutes) : 5 : max(minutes); % you might want to round the lower and upper limits to align with a 5 minute interval.
Use histc to count how many drops in each bin
>> drops = histc(minutes, edges);

What is another similar logic that is faster than ismember?

Continue my research,
I need another similar logic with ismember that has execution time faster.
this part of my code and the matrix.
StartPost =
14 50 30 1 72 44 76 68 63 80 25 41;
14 50 30 1 61 72 42 46 67 77 81 27;
35 23 8 54 19 70 48 75 66 79 2 84;
35 23 8 54 82 72 78 68 19 2 48 66;
69 24 36 52 63 47 5 18 11 82 1 15;
69 24 36 52 48 18 1 12 80 63 6 84;
73 38 50 7 1 33 24 68 29 20 62 84;
73 38 50 7 26 61 65 32 22 18 2 69]
for h=2:2:8,
...
done=all(ismember(StartPost(h,1:4),StartPost(h-1,1:4)));
...
end
I checked that code by using Profile viewer. I got that in this part that made my code took time execution slowly.
Anyone has experience about this logic, please share. thanks

MATLAB has several undocumented built-in functions which can help you achieve the same results as other functions, only faster.
In your case, you can use ismembc:
done = all(ismembc(StartPost(h, 1:4), sort(StartPost(h-1, 1:4)))));
Note that ismembc(A, B) requires matrix B to be sorted and not to contain any NaNs values.
Here's the execution time difference for your example:
tic
for h = 2:2:8
done = all(ismember(StartPost(h, 1:4), StartPost(h-1, 1:4)));
end
toc
Elapsed time is 0.029888 seconds.
tic
for h = 2:2:8
done = all(ismembc(StartPost(h, 1:4), sort(StartPost(h-1, 1:4))));
end
toc
Elapsed time is 0.006820 seconds.
This is about ~50 times faster.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

INTEL X86，why do align access and non-align access have same performance? - x86-64

Related

How can I interrupt a 'loop' in kdb?

Implementing matching pursuit algorithm

Slow speed of UDP reception in Matlab

MATLAB - Counting the # of rows that are only numbered with multiples of 5

What is another similar logic that is faster than ismember?

Categories

Resources