Marking values from the previous N number of days in KDB based on criteria? - kdb

Initial Table
company time value
-------------------------
a 00:00:15.000 100
a 00:00:30.000 100
b 00:01:00.000 100
a 00:01:10.000 100
a 00:01:15.000 100
a 00:01:20.000 300
a 00:01:25.000 100
b 00:01:30.000 400
a 00:01:50.000 100
a 00:02:00.000 100
a 00:00:03.000 200
Let t = 1 hour.
For each row, I would like to look back t time.
Entries falling in t will form a time window. I would like to get max(time window) - min (time window) / number of events).
For example, if it is 12:00 now, and there are a total of five events, 12:00, 11:50, 11:40, 11:30, 10:30, four of which falls in the window of t i.e. 12:00, 11:50, 11:40, 11:30, the result will be 12:00 - 11:30 / 4.
Additionally, the window should only account for rows with the same value and company name.
Resultant Table
company time value x
--------------------------------
a 00:00:15.000 100 0 (First event A).
a 00:00:30.000 100 15 (30 - 15 / 2 events).
b 00:01:00.000 100 0 (First event of company B).
a 00:01:10.000 100 55/3 = 18.33 (1:10 - 0:15 / 3 events).
a 00:01:15.000 100 60/4 = 15 (1:15 - 0:15 / 4 events).
a 00:01:20.000 300 0 (Different value).
a 00:01:25.000 100 55/4 = 13.75 (01:25 - 0:30 / 4 events).
b 00:01:30.000 400 0 (Different value and company).
a 00:01:50.000 100 40/4 = 10 (01:50 - 01:10 / 4 events).
a 00:02:00.000 100 50/5 = 10 (02:00 - 01:10 / 5 events).
a 00:03:00.000 200 0 (Different value).
Any help will be greatly appreciated. If it helps, I asked a similar question, which worked splendidly: Sum values from the previous N number of days in KDB?
Table Query
([] company:`a`a`b`a`a`a`a`b`a`a`a; time: 00:00:15.000 00:00:30.000 00:01:00.000 00:01:10.000 00:01:15.000 00:01:20.000 00:01:25.000 00:01:30.000 00:01:50.000 00:02:00.000 00:03:00.000; v: 100 100 100 100 100 300 100 400 100 100 200)

You may wish to use the following;
q)update x:((time-time[time binr time-01:00:00])%60000)%count each v where each time within/:flip(time-01:00:00;time) by company,v from t
company time v x
---------------------------------
a 00:15:00.000 100 0
a 00:30:00.000 100 7.5
b 01:00:00.000 100 0
a 01:10:00.000 100 18.33333
a 01:15:00.000 100 15
a 01:20:00.000 300 0
a 01:25:00.000 100 13.75
b 01:30:00.000 400 0
a 01:50:00.000 100 10
a 02:00:00.000 100 10
a 03:00:00.000 200 0
It uses time binr time-01:00:00 to get the index of the min time for the previous 1 hour of each time.
Then (time-time[time binr time-01:00:00])%60000 gives the respective time range (i.e., time - min time) for each time in minutes.
count each v where each time within/:flip(time-01:00:00;time) gives the number of rows within this range.
Dividing the two and implementing by company,v applies it all only to those that have the same company and v values.
Hope this helps.
Kevin

If your table is ordered by time then below solution will give you the required result. You can also order your table by time if it is not already using xasc.
I have also modified the table to have time with different hour values.
q) t:([] company:`a`a`b`a`a`a`a`b`a`a`a; time: 00:15:00.000 00:30:00.000 01:00:00.000 01:10:00.000 01:15:00.000 01:20:00.000 01:25:00.000 01:30:00.000 01:50:00.000 02:00:00.000 03:00:00.000; v: 100 100 100 100 100 300 100 400 100 100 200)
q) f:{(`int$x-x i) % 60000*1+til[count x]-i:x binr x-01:00:00}
q) update res:f time by company,v from t
Output
company time v res
---------------------------------
a 00:15:00.000 100 0
a 00:30:00.000 100 7.5
b 01:00:00.000 100 0
a 01:10:00.000 100 18.33333
a 01:15:00.000 100 15
a 01:20:00.000 300 0
a 01:25:00.000 100 13.75
b 01:30:00.000 400 0
a 01:50:00.000 100 10
a 02:00:00.000 100 10
a 03:00:00.000 200 0
You can modify the function f to change time window value. Or change f to accept that as an input parameter.
Explanation:
We pass time vector by company, value to a function f. It deducts 1 hour from each time value and then uses binr to get the index of the first time entry within 1-hour window range from the input time vector.
q) i:x binr x-01:00:00
q) 0 0 0 0 1 2 2
After that, it uses the indexes of the output to calculate the total count. Here I am multiplying the count by 60000 as time differences are in milliseconds because it is casting it to int.
q) 60000*1+til[count x]-i
q) 60000 120000 180000 240000 240000 240000 300000
Then finally we subtract the min and max time for each value and divide them by the above counts. Since time vector is ordered(ascending), the input time vector can be used as the max value and min values are at indexes referred by i.
q) (`int$x-x i) % 60000*1+til[count x]-i

Related

Resampling multiple data columns from minutes to hours in matlab

I got a big data set of minutly data with multiple columns that needs to be converted from minutes to hours.
I am new to matlab and tried
data_minute = rand(data); % synthetic data
data_hour = mean(reshape(data_minute, 60, []))
which only gives me the hourly data from one row.
I wasnt able to work through every column with something like:
for i = 1:n_columns
data_hour(:,i) = mean(reshape(data_minute(:,i),60, []));
end
Trying a For-Loop to sample every 60 data plots also didn't work out.
Looking at a solution in google didn't give me a result i understood.
Update:
For clarification the data looks something like this:
minute value
1 501
2 479
3 449
4 463
5 404
6 173
7 141
8 141
9 141
10 140
11 140
12 140
13 140
14 202
15 206
16 206
.. ...
525604 120
This sounds like a job for timetable and retime. First make a timetable, using a duration for the "time" variable - it's easy to create a duration array using the minutes function. For example:
>> tt = timetable(minutes(0:1000)', rand(1001, 1));
>> % Just look at the first few rows of 'tt':
>> head(tt)
ans =
8×1 timetable
Time Var1
_____ ________
0 min 0.31907
1 min 0.98605
2 min 0.71818
3 min 0.41318
4 min 0.09863
5 min 0.73456
6 min 0.63731
7 min 0.073842
>> % use 'retime' to get the hourly means:
>> rt = retime(tt, 'hourly', 'mean')
rt =
17×1 timetable
Time Var1
_______ _______
0 min 0.47755
60 min 0.47877
120 min 0.48007
180 min 0.55399
240 min 0.5142
300 min 0.5656
360 min 0.50957
420 min 0.48986
480 min 0.49568
540 min 0.55133
600 min 0.49981
660 min 0.53677
720 min 0.49343
780 min 0.53409
840 min 0.47901
900 min 0.55287
960 min 0.48173
We want to: Downsample the data with an aggregation or an interpolation of all the measurements grouped by hour.
If we take this example data matrice:
M = [10, 3,4,5,6;
2000, 3,4,3,5;
5000, 4,4,4,4]
And we say that the first column correspond to the time in second, and the other columns correspond to your measurements.
Solution 1: Aggregation with accumarray
% we start by calculating the time in hour (3600 seconds in one hour).
hour = ceil(M(:,1)/3600)
% We extract the measurements
val = M(:,2:end)
% nrow = How many different measurements ?
nrow = size(val,2);
% How many unique hour ?
[uid,~,id] = unique(hour);
% creation of a sub index grouping the measurements by hour and by column
sub = [repmat(id,nrow,1),kron(1:nrow,ones(1,length(id))).']
sub =
1 1
1 1
2 1
1 2
1 2
2 2
1 3
1 3
2 3
1 4
1 4
2 4
%We calculate the result using accumarray (first column = hour):
RES = [uid,accumarray(sub,val(:),[],#median)] %if you want the mean choose #mean
RES =
1.0000 3.0000 4.0000 4.0000 5.5000
2.0000 4.0000 4.0000 4.0000 4.0000
Solution 2: Interpolation with interp1
You can interpolate your data with interp1
interp_second = unique(floor(M(:,1)/3600))*3600
%création of an unique index
uid = unique(ceil(M(:,1)/3600))
% We extract the measurements
val = M(:,2:end)
% Result (first column = hour)
RES = [uid,interp1(M(:,1),val,interp_second)]
Conclusion
I would recommand the solution 1, because the method is more robust.

kdb how to aj with the first time of appearance

Here is my problem:
I have two tables:
q)t1:([]sym:1 5;x: 90 90)
q)t2:([]sym: 2 3 4 6 7 8; y: 100 200 300 400 500 600)
If I do aj[`sym;t2;t1], all the 6 columns in the result table will contain x with value 90.
But what I want is value 90 in column x only in row with sym 2 and 6, i.e the first time that sym in table t2 appear before table t1.
In other words, I want the result table to be like this:
q)([]sym:2 3 4 6 7 8; y: 100 200 300 400 500 600; x:90 0N 0N 90 0N 0N)
sym y x
----------
2 100 90
3 200
4 300
6 400 90
7 500
8 600
Could anyone tell me how I can achieve this? Thank you so much!
Not sure if aj can be used in this sense. This might give you what you need:
q)t2 lj 1!update sym:{x x binr y}[t2.sym;sym] from t1
sym y x
----------
2 100 90
3 200
4 300
6 400 90
7 500
8 600
Uses binr to find the next value greater than the value in t1 then joins only on that.
EDIT: note also that binr is >= ..... If you need strictly greater than you could use:
q)t2 lj 1!update sym:{x 1+x bin y}[t2.sym;sym] from t1
sym y x
----------
2 100 90
3 200
5 300
6 400 90
7 500
8 600
You can do aj to get the index where nearest smaller number of x will fit in, then a vector condition to get x when that index has got incremented, i.e.
select sym, y, x:?[c>prev c;x;0n] from aj[`sym; t2; update c:i from t1]

Count group of ones in series

I have a series a=[100 200 1 1 1 243 300 1 1 1 1 1 400 1 900 600 900 1 1 1 ]
I have to count how many times 1 occur when it occurs in group.
First group of 1's, sum is 3 (lying between 200 and 243).
Second group of ones lying between 300 and 400 is 5. Sum of all ones in each group is [3 5 1 3].
Please give me some suggestions.
Use diff on a==1. Bracket with false to assure the count is correct no matter what the starting or ending values of a. Finally, find the start and end of each run and subract:
d = diff([false, a==1, false]);
result = find(d==-1) - find(d==1);
In your example this gives
result =
3 5 1 3

Limited Sum in Matlab

Hi lets say that i have matrix size 5x5.
B=[1 2 3 4 5; 10 20 30 40 50; 100 200 300 400 500; 1000 2000 3000 4000 5000; 10000 20000 30000 40000 50000];
How do i use function sum, to sum rows between 2 and 4 and have result:
A = [1110;2220;3330;4440]
You'll find some useful information about matrix indexing in the documentation at http://www.mathworks.co.uk/help/matlab/math/matrix-indexing.html
To illustrate your example, you can use B(2:4,:) to retreive the following:
ans =
10 20 30 40 50
100 200 300 400 500
1000 2000 3000 4000 5000
You can then use the sum function as follows to achieve your desired result:
A = sum(B(2:4,:))
I hope this helps!
All the best,
Matt
MATLAB>> sum(B(2:4,1:4))
ans =
1110 2220 3330 4440
If you want to transpose the result, add ' at the end.

Matlab simulation: Query regarding generating random numbers

I am doing some simulations studies and for initial stuides I am trying to simulate 100 gas particles and then grouping of these gas particles in 5 groups randomly for 10 or 100 times (non zero values in any groups). after that i have to find the group with highest particle and the number.
for example
100 gas particles
1 2 3 4 5(groups) Total particle group/Highest number
20|20|20|20|20 100 1-2-3-4-5/20
70|16|04|01|09 100 1/70
18|28|29|10|15 100 3/29
.
.
etc
i have used this to generate 5 random numbers for a single time
for i=1:1
randi([1,100],1,5)
end
ans =
50 41 9 60 88
but how will i find the highest number and group?
Use the max function :
a = [50 41 9 60 88];
[C,I] = max(a)
C should be equal to 88 and I to 4.
For the special case of equality (first line in your code), you have to read the documentation to see the result of max. I think the index returned will be the first max.